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HIGHLY  SCAL/ 


DATA  BALANCED  . 
STRUCTURES 


' SEARCH 


The  algorithms  for  data  balaxicing  determine  how  the  tree  nodes  are  assigned 
to  prsKesaors.  Here,  we  develop  several  algorithms  for  data  balancing,  both  for  the 
dB-tree  and  the  dE-tree.  We  find  that  a simple  distributed  data  balancing  algorithm 
works  well  for  the  dB-tree,  reqniriog  only  a small  space  and  message  passing  overhead. 
We  compare  three  algorithms  for  data  balancing  In  a dE-tree,  and  find  that  the  most 
aggressive  of  the  algorithms  makes  the  dE-tree  scalable. 

We  have  developed  twoalgorithiits  for  replication,  namely  full  replication  and  path 
replication  and  studied  their  performance.  We  have  observed  that  path  replication 

We  have  also  performed  some  timing  experiments  on  our  dB*tree  to  study  the 
response  times  and  throughput  of  our  system.  The  experiment  was  perfonited  on  4, 

performed  expcrimcDts  to  obtain  the  message  transmission  time  and  processing  time. 

We  have  developed  an  analytical  performance  model  of  the  dB-tree  and  the  dE- 
tree  using  the  data  from  the  simuiatioo  experiments.  We  then  applied  the  model  to 

that  our  analytical  model  predicts  a more  pessimistic  liming  than  the  timing  we 
obtained  from  our  experiments.  Our  experiments  ^ve  a value  of  50  milliseconds 
response  lime  with  8 proccflsors,  whereas  the  model  predicts  56.5  milliseconds.  From 
the  analytical  model,  we  observe  that  a distributed  search  structure  permits  a much 
larger  throughput  than  a centrallzetl  index  server,  at  the  cost  of  a modestly  iocreased 
response  time. 


CHAPTER  I 
INTRODUCTION 

1.1  Obifctive 

The  itiajD  object)  vcoT  this  research  was  to  develop  distributed  algorithms  and  pro- 
tocols for  some  specific  data  structures,  and  implement  them  to  study  thrir  feasibility 

rithms  for  specific  purposes  and  to  study  their  correctness  criteria.  Distributed 

However,  not  all  algorithms  designed  can  be  implemented  efficiently.  We  in- 
tended to  study  the  Implementation  of  dynamic  algorithms  on  a network  of 
processors  without  shared  memory. 

2.  Implementation:  Prom  the  implementation  view  point,  we  were  concerned  with 
the  efficiency  and  performance  of  the  algorithms.  The  implementation  of  these 
distributed  data  sliuctares  will  bide  from  the  application's  user,  the  deluls  of 
the  sites  where  data  are  stored,  the  access  methods,  and  the  syDcbroniaatimi 
techniques. 

For  the  purposes  of  the  research  we  selected  the  B-trec  for  Its  flexibility  and  its 
practical  use  in  indexing  large  amounts  of  data. 


Why  DislribuKd  SeMTh  Stnictufa 


Current  commurci&l  and  Bcientifir  database  systems  deal  with  vast  amounts  of 
data.  Since  the  volume  of  data  to  be  handled  is  so  large,  it  may  not  be  possible  to 
store  all  the  data  in  one  place.  Also,  when  addressing  large  volumes  of  data,  there  Is 
the  danger  of  memory  bottlenecks.  Therefore,  distributed  teebniques  are  necessary  to 
create  large-scale,  efficient,  distributed  storage  [24|.  Distributed  data  structures  allow 

them  among  the  storage  sites  of  the  system,  which  also  allows  for  parallel  access  to  the 
data.  Distributed  data  structures  are  useful  for  many  distributed  applications  (e.g., 
in  permanent  information  storage  and  retrieval  techniques,  global  name  servers  in 
networks,  resource  allocation,  etc,).  Although  a considerable  amount  of  research  has 
been  done  in  developing  parallel  search  structures  on  shared-memory  multiprocessors, 
little  has  been  done  on  the  development  ol  search  slnictiires  for  distributed-memory 
systems-  One  sucb  search  structure  is  the  B-trec-  The  B-trcc  was  selected  because 
of  its  flexibility  and  its  practical  use  in  indexing  large  amounts  of  data. 

A dt.slrihufed  spstem  is  a collection  ol  processor-merDory  pairs  connected  by  an  in- 
terconnection network.  Distributed  systems  have  several  advantages  over  centralised 
systems  because  they  enable  case  of  expansion,  provide  increased  reliability,  allow 
actual  geograpbic  distribution,  and  have  a higher  potential  for  fault-tolerance  and 
performance  due  to  the  multiplicity  of  resources.  Each  processor-memory  pair  will 
henceforth  be  called  u site.  Sites  communicate  by  message  paasing.  It  is  believed 
that  message  pusing  multiprocessors  are  highly  scalable.  In  a distributed  system 
no  single  site  bas  complete,  accurate  and  up-to-date  information  of  the  global  state 
of  the  system.  Thus,  each  site  ntust  have  tbe  capability  of  handling  inaccurate  and 
out'of.date  information.  Distributed  algorithms  must  tolerate  these  inconsisteocies. 
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I.I.'j  Thf  for  Diatributoj  DaI*  Slructiirea 

The  data  atructures  used  in  an  aJgoritbm  have  a considerable  dTect  on  the  ef- 
ficiency of  the  algorithm.  Hence,  for  distributed  algorithms,  there  is  a need  for 


I.  The  primary  reason  for  distributed  data  structures  is  that  in  a distributed 
system  we  wish  to  share  the  data  between  processes  on  different  processors.  The 
various  parts  of  the  distributed  system  share  data  by  communication.  Several 
programming  lauguages  only  support  shared  variables  that  allow  for  pseudo- 

variables  can  be  implemented  by  simulating  shared  physical  memory,  but  this 
is  not  sufficient  for  distributed  systems  that  call  for  contplex  data  structitres. 
Instead,  there  arc  basically  three  ways  of  providing  the  notion  of  shared  data 
io  a distributed  system  : 


(b)  Shared  logical  variables 

(c)  Oistrihuted  objects  io  distributed  shared  memory 

2-  A secondary  reason  for  distributed  data  structures  is  the  problem  of  maintain- 
ing large  data  structures  at  one  physical  location.  This  not  only  requires  a large 
amount  of  memory  but  also  mahes  the  system  less  fault  tolerant.  Distribut- 
ing a data  structure  over  a large  number  of  processors  implies  partilioniog  tbe 
data  structure  into  parts  that  are  individually  managed  by  a single  proces- 
sor. Tbe  parts  may  be  disjointed  or  they  may  be  replicated  to  provide  better 


locality  and  increase  availability  But  Id  replication  there  is  the  problem  of  in* 


consistency.  Distributing  the  shared  data  over  the  dilferent  processors  improves 


Several  prograinmiog  languages  provide  the  above  notion  of  shared  data,  where 


Data  organization  must  be  based  on  the  prineipit  oj  uhtguilp  defined  as 
• all  data  objects  are  accessible  to  all  sites; 


s consistency  is  maiotained  on  a global  basis. 

To  achieve  these  criteria,  data  must  be  replicated  and  updates  must  be  atomic. 
All  these  criteria  improve  the  performance  and  reduce  the  cost  of  access  by  allocating 
data  depending  on  the  locality  of  a process.  In  some  data  structures,  the  access  nature 
is  predictable,  while  in  others  not  so. 

Data  structures  are  characterized  by  the  operations  they  support.  A distributed 
data  structure  consisu  of  a set  of  local  data  structures  storing  the  data  at  various 
sites  of  the  system  and  a set  of  protocols  for  access  to  the  distributed  data  stnjc, 
tures.  These  protocols  specify  the  query  and  update  operations  on  the  datum.  The 
distributioQ  of  the  data  structure  is  known  as  the  data  organisation  scheme  [2]  and 
may  be  based  on  several  criteria  as  follows: 

1.  To  improve  the  locality  of  tbe  process  nmniog  on  the  processor 

2.  To  reduce  the  message  complexity  of  access  to  remote  data 


performance,  since  data  residing  at  dilfereot  sites  can  be  accessed  in  parallel. 


s on  an  access  the  most  recent  version  of  the  data  is  provided; 


3.  To  b&l&ncr  tbe  dot&  aoross  procesaora  for  efTicient  usage 

4.  To  improve  tbe  fault  tolerance  and  increase  availability 


munication  cost  imposed  by  the  underlying  machine  arcbitecture.  Several  strategies 

specify  the  primitive  operations  that  are  to  be  performed  on  tbe  data  structures  and 
the  mode  of  access  by  processors  to  the  data. 


1.1.4  Tbe  Need  for  Replication 

Redundancy  or  replication  is  an  inherent  part  of  the  design  of  distributed  data 
structures.  Not  only  does  replication  provide  fault  tolerance  io  the  eveot  of  the  fail, 
ure  of  a processor,  but  it  also  enables  dynamic  data  balancing  and  reduces  costs 
by  placing  tile  more  often  accessed  data  close  to  the  praerssor.  A process  can  take 
advantage  of  its  locality  to  reduce  the  cost  of  communication.  Replicalioo  also  io* 
creases  the  availability  of  data.  A factor  that  has  to  be  considered  is  the  degree  of 
replication,  also  known  as  replication  control.  In  what  Is  called  the  total  structure, 
all  the  data  are  replicated  at  each  processor  [46).  This  increases  tbe  availability  and 
fault  tolerance  but  places  a high  demand  on  memory  tequiremenU.  A compromise  Is 
to  set  up  a balance  between  memory  usage  and  cost  consideratioos  [2|.  Replication 

data  structure. 


6 


Dislributing  data  stnictuica  nvates  D«w  issues  not  present  in  a shared  memory  or 
a single  processor  system.  Two  basic  problems  are  those  created  by  the  concurrency  of 

Concurrency  issues  are  resolved  by  imposing  the  seriajizabllity  criteria.  Various 
serializability  criteria  have  been  studied  in  connection  to  databases  [8]. 

Tbe  study  of  the  distribution  of  data  structures  and  the  telationsliip  to  the  under- 
lying  system  of  processors  may  lead  to  elhcieot  schemes  for  distributing  the  data  in 
terms  of  space,  time  and  message  compleaity  [2|.  The  complexity  of  data  movements 
is  also  an  iasuc  for  distributed  structures. 

The  search  structure  selected  for  this  research  is  the  B-tree.  We  address  all  the 
above  issues  with  respect  to  distributing  a B-lree.  We  have  selected  the  B-tree 
because  of  its  flexibility  and  its  practical  use  in  indexing  large  amounia  of  data. 


1,2,1  Inlroduelion 

In  this  section,  wc  present  a survey  on  the  research  done  in  distributed  data 
structures.  Technit|ues  that  some  programming  languages  provide  to  support  dis- 
tributed data  structures  are  presented.  A brief  disrussion  of  basic  distributed  data 
structures  is  then  presented.  In  the  discussion  of  search  structures  that  this  research 
concentrates  on,  we  focus  on  hash  tables,  dictionaries  and  concurrent  B*trecs.  Some 
background  on  data  balancing  is  also  presented.  Fiually,  the  coucurrent  fi.tree  link 
algorithm  is  presented,  which  forms  the  basis  for  the  distributed  B.tree  algorithms. 


1.2.2  Pr 


than  the  ehored  menioiy  machinea.  Tbia  is  due  to  the  lack  of  a single  global  address 

and  managing  communication  between  processes.  Thb  may  reduce  programmer  pro- 

parallel  and  distributed  programs,  in  the  current  conventional  programming  lan- 
guages, each  process  can  only  access  its  local  address  space,  which  results  in  large 

communication  is  usually  more  expensive  than  computation,  it  is  essential  that  much 
of  the  computation  be  done  uAng  local  data.  Several  programming  languages  are 
being  devebped  to  support  distributed  data  structures.  Some  examples  are  Linda 
[If,  [I0|,  Orca  |2|,  |4J  and  Kali  [32).  Some  programming  languages  provide  distributed 
data  structures  explicitly,  while  others  do  so  implicitly.  Examine  the  following  three: 

The  distributed  data  structure  paradigm  was  first  introduced  in  the  language 
LIuda,  implemented  on  ATiiT  Bell  Lab's  Net  multicomputer  (1).  The  Tbple 
Space  concept  is  used  for  implementing  distributed  data  structures.  This  lupb 
space,  consistlug  of  tuples  tliat  ace  iu  an  ordered  sequence  of  values,  forms  a 
global  memory  shared  by  ail  the  processes  in  the  system.  To  modl^  a tuple,  a 
“read,  modify  and  write"  atomic  operatiou  is  needed.  If  two  processes  want  ac- 
cess lo  a tuple,  only  one  of  them  succeeds  while  the  other  blocks.  A distributed 
array  is  implemented  as  a tuple  consisting  of  < orraynome,  index,  value  >. 
The  tuples  are  distributed  across  processors  based  on  tbc  following  criteria: 

(a)  Either  the  entire  tuple  space  is  replicated;  or. 


(b)  The  last  pracessOT  to  create  a tuple  is  the  owner  of  the  tuple;  or, 

passing  and  remote  procedure  calls  are  simulated  using  the  tuple  space.  The 
processes  interact  only  through  the  tuple  space.  The  goal  of  Linda  is  to  relieve 
the  programmer  from  the  tash  of  parallel  programming. 

This  programming  language  Is  mainly  Intended  for  developing  parallel  algo- 
rithms for  distributed  systems.  The  data  structures  are  encapsulated  into 
passive  objects  and  can  be  shared  by  different  processors.  The  objects  are 
replicated  on  all  processors  and  are  updated  by  a reliable,  ordered  broadcast 
primitive  |2],  |4|. 

3.  Kali 

A programming  envIroDOiciit,  Kail  is  designed  to  aid  in  tbe  programming  of 
distributed  memory  architectures  [32].  It  allows  Ihe  programmer  to  treat  the 
distributed  data  structures  as  siugle  objects.  A software  layer  supports  a global 
name  space.  Algorithms  can  be  specified  at  a higii  level  and  tbe  compiler  trans- 
forms the  high  level  specification  into  a set  of  tasks  that  interact  by  '•message 
passing."  Thus,  the  programmer  is  relieved  of  Ihe  task  of  programming  with 
low-levei  message  passing  primitives  and  can  concentrate  on  puts  algorithm 
development-  The  only  data  type  supported  in  this  is  distributed  arrays. 


Kcr«,  we  present  mcchanismB  of  how  some  of  the  b&slc  data  structures  have  been 


distributed.  Scalar  variables  are  usually  replicated  on  each  processor. 


Presently,  only  distributed  arrays  are  predefined  by  Kali.  However,  Kali  supports 

[32|.  The  clause  specifies  a set  of  distribution  patterns  for  each  dimension  of  the 
array.  An  asterisk  in  the  diniensioii  indicates  no  dlntribution.  The  number  of  array 
dimensions  that  arc  distributed  cannot  exceed  the  number  of  processors  in  tbe  system. 
Each  processor  stores  a siogle  copy  of  each  array  element. 

arrays  |1S).  This  is  a coostraint*based  approach,  wherein  the  compiler  analyses  each 
loop  and,  based  on  performance  considerations,  identifies  some  constraints  on  the 
dhtiibutioD  of  data  structures.  Pinally,  the  compiler  tries  to  combine  constraints  for 


each  data  structure  so  that  the  overali  execution  time  of  the  program  is  mioimixed. 
The  data  may  have  to  be  repartilioned  between  program  segments  and  between 
procedure  calls.  This  has  been  implemented  on  tbe  Intel  iPSC/2  bypcrcube. 


A queue  is  a First  In.  First  Out  (FIFO)  structure  that  has  two  ends,  a front  and 
a rear.  A queue  can  be  stored  in  a distributed  system  by  storing  different  segments 
of  the  queue  in  different  sites,  witli  esclg  queue  element  bang  stored  in  exactly  one 

Lee  et  al.  have  presented  a scheme  for  a fault  tolerant  distributed  queue  to  provide 
a high  degree  of  svailabilily,  greater  flexibility  and  low  access  cost  [MJ.  In  this  scheme, 


r rcpiicw  of  tUc  queue  are  made  and  each  replica  is  broken  into  r,  not  necessarily 

replica.  Tbere  is  no  concurrency  implemented  as  only  one  »te  in  the  entire  system 
is  allowed  to  perform  insertion  or  deletion  at  a time. 

Priority  queues  have  been  implemented  using  systolic  arrays.  Systolic  search  trees 
have  also  been  used  to  implement  mulliqueues  (38j. 


Bfficienl  search  structures  are  needed  for  maintaiaing  Rles  and  indices  in  conven. 
tional  systems  which  have  a small  primary  memory  and  a larger  secondary  memory. 
To  access  individual  entities  of  a file,  an  index  is  required.  The  normal  operations  car- 
ried oul  on  an  index  are  search,  insert,  and  delete.  A search  table  is  a data  structure 
in  which  records  are  organized  in  a wcll-dehned  manner.  Search  structures  are  used 
for  the  implementation  of  dictionaries.  An  implementation  of  a search  table  could  be 
designed  usiog  cither  a tree,  an  array  or  a bash  table  on  a sequential  machine. 

In  a traditional  design,  an  access  takes  a long  time  to  complete,  usually  on  the 

as  trees,  sorted  arrays  and  hash  tables  have  been  used  to  implement  search  tablee. 
Of  these,  the  hash  table  gives  the  heal  performance,  with  little  space  overhead.  Par- 
allelism is  achieved  by  pipelining  the  accesses;  however  the  sequential  nature  of  the 
accesses  creates  a bottleneck.  Therefore,  a simultaneous  design  that  accepts  and 
handles  consecutive  accesses  concurrently  Is  necessary. 

Distributed  memory  data  structures  have  been  proposed  by  Ellis  [UJ,  Severance 
[54),  Peleg  [46],  Clolbrook  et  al.  [12]  and  .lohnson  and  Colbrook  [23],  Colbrook  et 
al.  [12)  have  proposed  a pipelined  distrihuled  D-tree,  where  each  level  of  the  tree 
is  maintained  by  a dhferent  processor.  The  parallelism  achieved  is  limited  by  the 


height  of  the  B4tee  ud  the  proecesore  are  not  data  balaoced.  Parallel  B'trees  uaing 
multi'version  memory  have  been  proposed  by  Wruig  and  Weihl  [60].  The  algorithm 

Hash  Tables 

Haabing  is  a well  koowo  technique  for  fast  access  to  records  In  a large  database. 

ing  [14],  Iwo-phsae  hashing  |61],  trie  hashing  [40]  and  linear  hashing  for  distributed 
RIes  (41). 

s Distributed  Linear  Hashing: 

In  linear  hashing,  the  table  is  gradually  expanded  by  splitting  the  buckets  imtil 
the  table  has  doubled  its  sise.  Splitting  means  rehashing  trf  a bucket  b and 

A distributed  linear  bashing  method  particularly  useful  for  main  memory  databases 
has  been  discussed  by  Severance  and  Pramanik  (|54j).  In  linear  hashing,  the 
records  are  distributed  into  buckets  that  are  stored  tm  disk,  but  in  distributed 
linear  bashing  the  buckets  are  stored  in  main  memory.  First  a bucket  is  located, 
and  second,  the  record  chain  in  the  bucket  is  found  by  another  computation. 

By  having  pointers  in  them,  the  records  in  the  bucket  can  be  placed  in  any 
memory  module.  An  index  is  used  to  point  to  the  bucket  directories  and  is 
cached  in  each  processor.  Address  computation  is  done  ImraJIy.  To  avoid  hoi 
spots  in  accessing  the  central  variables,  a local  copy  is  kept  at  each  processor. 
The  local  copies  may  be  oul-oLdate  at  limes,  caitsing  incorrect  bucket  address 
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computation.  Retry  lofpc  ia  used  to  solve  this  problem.  Tbe  paper  also  ad- 

recovery  mechanisms.  The  design  has  been  implemented  on  a BBN  Butterfly 
multiprocessor  system, 
a Extendible  Hashing 

represent  the  index^  an  extendible  trie  structure  is  used  instead  of  a binary 
tree.  The  index  table  contains  3^  positions,  where  d is  the  number  of  lefl-mosl 
bits  currently  being  used  to  address  the  index  table.  Initially,  the  tabic  contaios 
only  one  position,  which  points  to  a single  bucket  in  use.  When  this  bucket  Rlls, 
the  table  is  doubled  in  else,  a new  bucket  is  created  and  the  keys  are  rearranged. 
Ellis  |M|  has  proposed  a distributed  extendible  hashing  technique.  As  in  a 
sequential  systero  the  bash  structure  consists  of  two  parts:  the  directory  com- 
ponent and  the  buckets.  It  is  this  Indirection  provided  by  the  directory  that 
allows  tbe  buckets  to  be  distributed  to  dilfereol  sites  in  tbe  distributed  system 
and  the  directories  to  be  replicated  among  the  sites  and  managed  by  directory 
managers.  Tbe  buckets  are  linked  to  each  other  through  a link  field  that  allows 
recovery  from  restructuring  operations.  The  directory  manager  ia  essentially  a 
server  capable  of  handling  multiple  requests.  The  Inicket  manager  is  a front-end 
process  that  manages  a disjmnt  set  of  buckets.  An  operation  request  on  the 
hash  table  is  sent  to  any  directory  manager  which  in  turn  forwards  tbe  request 

The  directory  manager  is  then  tree  to  accept  another  request.  A bucket  manager 
on  receiving  a request  spawns  a new  slave  process  to  service  lire  request.  Tbe 
directory  manager  has  to  propagate  the  update  information  to  all  the  other 
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directory  m&nftgcrs  therefore  one  probtem  ic  that  ita  failure  aJfecta  the  entire 
ayetem.  I^ult  tolerance  capabililiea  arc  diacussed  that  involve  rnore  measa^ea 
in  the  ayatem- 
• Two  Phase  Hashing 

A new  bash  algorithm  for  maasivety  [>arallel  systems  is  proposed  by  Yen  and 
BasUnl  (|Slj).  In  sequential  systems,  chaining  g^ves  the  best  performance,  but 
in  massively  parallel  proceasora,  this  leads  to  a high  communication  cost.  Linear 

This  algorithm,  called  two-phase  hashing,  combines  the  chaining  and  linear 
probing  concepts.  Here,  a hash  table  with  in  table  chaining  is  used:  the  hash 
table  keeps  chains  in  the  tabic  itself  Instead  of  having  other  chain  nodes.  If 
the  number  of  elements  hashed  at  each  entry  is  known,  then  the  final  location 
of  each  element  can  be  computed.  The  Rest  phase  computes  the  number  of 
elements  that  are  to  he  hashed  at  each  entry.  From  this,  the  fuial  location  of 
each  element  is  computed.  The  neat  phase  produces  the  real  bashing  where  the 
data  are  forwarded  to  the  hash  entry.  Fkom  the  hash  entry  the  data  arc  then 
forwarded  to  the  starting  location  of  the  chmo.  The  chain  is  then  searched.  A 
slight  variation  of  the  linear  probing  algorithm  known  as  the  hyperrube  ba.sh 
algorithm  is  also  discussed.  In  this  algorithm,  the  hash  table  is  mapped  directly 
to  the  processor  space  (i.e,  the  ith  entry  is  assigned  to  processor  i).  Collisions 
are  resolved  by  rehashing.  'The  difference  between  the  above  two  algorithms  is 
the  method  of  compulation  of  the  rehashed  location. 


TVie  Hashing 


Tri«  bashing  Lns  be(*n  discussed  by  Lilwin  [40].  As  Id  a DDrmal  hasblug  tecb- 
nique.  the  records  are  stored  in  buckels.  The  bucket  addresses  arc  computed 
with  a dynamic  trie  of  size  proportional  to  the  file  size.  The  trie  is  a result  of 
splits  that  cause  buckets  to  overhow.  The  trie  can  be  stored  oo  the  disk  as  sub- 
tries for  large  hies.  Normally,  because  of  the  high  branchiog  factor,  two  levels 
are  sulficicDt  to  store  a one  gigabyte  hie;  therefore,  two  accesses  are  sufficient. 
The  paper  also  proposes  a method  for  the  control  of  the  bucket  load  factor  of 

Linear  Hashing  for  distributed  files  has  been  propceed  by  Litwin  et  al.  |41j. 
It  is  useful  for  creating  large  files  where  the  distributiuo  of  objects  is  neces- 
sary to  exploit  parallelism.  It  is  suitable  for  creating  scalable  distributed  daU 
structures  (SDDSj.  The  mechanism  is  called  LH*  and  an  LH“  file  can  grow 
to  any  size.  A file  is  stored  in  a bucket  at  each  server  site.  Since  the  bucket 
itself  could  be  a single  disk  file,  it  is  possible  to  create  extremely  large  scalable 
files.  Clieots  insert  or  retrieve  objects  from  the  file.  Cbents  and  servers  are 
the  nodes  of  a network  anri  can  be  extended  to  any  number  of  sites.  A struc- 
ture is  termed  SDDS  if  it  can  expand  to  new  servers  gracefully  only  when  the 
currently  used  ones  are  efficiently  loaded.  There  should  be  no  master  site  that 
would  make  the  system  unreliable.  Finally,  file  access  primitives  should  not  be 
atomic  actions.  A simulation  of  the  SDDS  on  a sbared-nothing  multiprocessor 
showed  tiiat  it  takes  one  message  (three  in  the  worst  case]  per  key  insert  and 
Iwo  messages  (four  in  the  worst  case)  for  retrieval.  They  also  showed  that  the 
average  performance  is  close  to  optimal  for  both  inserts  and  retrievals. 

A family  of  order-preserving  scalable  dUtribuled  data  structures,  namely  RP‘ 
has  been  proposed  by  Litwin,  et  al  [42].  To  support  range-queries  and  ordered 
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crs.  The  fundamental  aJ|ontbm  builds  the  hln  with  the  same  key  space  as  a 
B*tree  but  without  the  indices  by  using  multicast.  Two  olhet  algorithms  en- 
hance throughput  of  the  network  by  adding  the  indices  on  either  the  clients,  or 

Distributed  file  organization  for  disk  resident  files  has  been  discussed  by  Vin- 
graek  et  al.  [56].  The  focus  of  their  work  has  been  to  achieve  scalability  (in 
terms  of  the  number  of  servers)  of  the  throughput  and  the  file  size  while  dy- 
namically distributing  data.  Their  results  indicate  that  scalability  is  achieved 
at  a controlled  cost/performancc. 

Dictionary 

A dictionary  is  a dynamic  data  structure  that  supports  the  following  operations; 
insert,  delete  and  search  (lookup).  It  is  one  of  the  most  fundamental  data  structures 

systems,  for  implementing  symbol  tables  and  pattern  matching  systems.  In  a conven- 
tional system  an  operation  on  the  dictionary  is  a function  of  the  number  of  elements. 
Pipelioiog  the  operations  will  give  more  parallelism,  but  this  leads  to  a bottleneck  for 
the  more  frequently  accessed  items.  The  bottleneck  becomes  severe  as  the  number 
of  processors  increases.  In  a single  processor  eovirooment,  dictionaries  are  usually 
implemented  as  tree  structures  such  as  the  AVL  tree  and  the  B^-liee.  The  response 
of  the  sequential  dictionary  machines  is  a logarithmic  function  of  the  number  of 

A setjueuLial  dictionary  maebioe  has  been  proposed  that  allows  simultaneous  and 
redundant  accesses  [15].  The  objective  of  this  system  is  to  remove  the  sequential 


data  eleinenU  being  atored  at  the  leaf  nodea.  The  aecesaee  are  aorted  to  form  groupa. 
The  data  elcmenta  are  alee  ordered  and  made  into  groupa  ao  that  the  Interaction  with 
a group  takea  logarithmic  time.  The  acceaaeu  within  a group  are  sent  to  different 
groups  of  data  elements.  The  binary  tree  serves  to  distribute  the  accesses. 

An  implemcutalion  of  a distributed  dictiooary  is  described  by  Dietzefelbinger  in 
(I3j.  The  implemeolation  ia  on  a completely  synchroniaed  network  of  processors  and 
is  based  on  hashing.  The  keys  to  be  inserted,  deleted  and  searched  are  distributed  to 
the  processors  via  a hash  function  and  processed  uaiog  a dynamic  hashing  technique. 

A distributed  dictionary  using  B.fink  trees  has  been  proposed  [23|.  This  paper  dis- 
tributes  the  nodes  of  the  tree  among  the  processors.  The  interior  nodes  are  replicated 
to  improve  parallelism  and  alleviate  the  bollleneck.  The  processor  that  owns  a leaf 
owns  all  the  nodes  on  the  path  from  the  coot  to  the  leaf.  Restructuring  decisions  are 
made  locally,  thereby  reducing  the  coiitmiioication  emts  and  increaung  parallelism. 
The  paper  also  deals  with  the  problem  of  data  baiancing  across  the  processors. 

Another  highly  concurrent  dictionary  for  parallel  shared  memory  has  bceu  de. 
scribed  by  Parker  |45].  This  approach  implemenU  a diclionary  independent  of  the 
underlying  archilecture.  A new  data  structure  called  a sibling  trie  is  used  to  im- 
plement the  dictionary.  Sibliog  tries,  though  baaed  on  trees,  are  a special  kind  of 
graph.  The  graph  should  he  strongly  connected  so  that  every  node  is  reachable  from 
every  other  node.  Multiple  processes  can  search,  insert,  delete  and  update  the  data 
without  creating  hot  spots.  The  advantage  of  using  the  sibbng  Irie  is  that  the  search 
can  start  from  any  node,  not  necessarily  the  root,  thereby  reducing  hot  spots  and 
providing  alternate  routes  to  a data  item.  The  trie,  a binary  tree  that  impIcinenU 
a radix  search,  is  the  first  component  of  the  data  structure  and  the  second  one  is  a 
sibling  graph  which  conoecls  nodes  at  the  same  level.  Parker  uses  links  to  increase 


concurreDcy.  The  sibling  graph  Is  similar  to  the  links  used  in  a B-lree  [49]  and  allows 
fast  sequential  access.  The  links  in  the  sibling  graph  are  used  to  traverse  the  entire 
structure,  hence  the  diameter  of  the  graph  must  be  kept  small.  The  paper  presents 

Pelcg  [46]  has  presented  a detailed  example  of  a distributed  data  structure.  A 
compact  dictionary  structure,  called  BIN.  is  described.  The  BIN  is  based  on  a ‘^at" 

of  bins  (each  maintained  at  some  vertex)  that  store  the  data  in  an  orderly  fashion. 
The  paper  also  discusses  the  distribution  of  the  central  server  and  replication  issues. 
Complexity  issues  and  memory  balancing  are  also  addressed. 

Search  structures  based  on  Linear  Ordinary-Leaves  Structures  (LOLS)  family, 
(such  as  K-D-B-trecs,  etc.)  have  been  proposed  [4Jj.  The  paper  addresses 

the  problem  of  designing  search  structures  to  lit  shared  memory  multiprocessor  mul- 
tidisk systems.  The  index  of  the  slructure  is  partitioned  into  a number  ofidentical 
sub-indices  (tbe  sub-indices  have  the  same  structure  and  contents)  which  are  stored 
in  the  shared  memory  while  the  data  leaves  that  contain  the  data  records  are  dis- 
tributed across  the  processors.  The  design  goal  is  to  decrease  the  main  memory 
consumption  while  having  the  satne  parallel  processing  capability,  tbe  same 
time  per  operation  and  same  disk  utilisation  as  other  methods  which  use  a single 
index  stniclure. 

1.3  Contributions  of  this  Work 

This  dissertation  addresses  several  issues  such  as  fully  distributing  a B-lree,  lo- 
cation independent  naming  of  a node,  data  balancing  and  replic 


1,  aniotig  others. 


Data  parlilioDing  also  raists  new  issues  such  as  allocatiog  storage  for  the  data,  ef- 
ficiency of  access  and  balancing  data  among  processors.  The  factors  oi  coocern  for 
distributed  storage  are  throughput,  scalability  and  reliability.  Most  of  these  topics  of 

We  have  developed  a theoretical  frameworh  for  replicatiog  the  interior  oodes  of  the 
B-tree.  Based  on  this,  we  have  iinplemented  two  strategies  of  replication,  namely  full- 
replication  and  path-replication.  The  performance  of  these  algorithms  show  that  the 
path-replication  is  Itetter  and  is  more  scalable.  We  have  developed  several  algorithms 
for  data  balancing  a distribnted  replicated  B-tree.  We  present  the  performance  of 
our  algorithms.  An  application  of  the  work  is  the  distributed  extent  tree,  (dE-tree). 
We  developed  several  data  balancing  algorithms  for  the  distributed  extent  tree. 

1.3.1  Structure  of  the  Dinertation 

We  have  organised  this  dissertation  into  two  broad  categories:  theory  and  prac- 
tice. In  Chapter  2,  we  provide  background  on  concurrent  B-lrees,  the  distributed 
B-tree  and  Che  distributed  extent  tree. 

Chapter  3 provides  the  theoretical  framework  for  replicating  tbe  nodes  of  a B- 
tree.  In  Chapter  4,  we  present  the  implementation  design  details.  We  present  tbe 
underlying  architecture  and  message  passiug  mechanism  for  our  implemeutation.  We 
also  preseal  some  generalised  protocols  that  arc  common  In  all  our  data  balancing 
algorithms.  Finally,  we  discuss  the  portability  of  our  implementation  from  the  SUNs 
to  the  KSR,  a shared  incrrmry  parallel  machine. 

The  performance  of  our  replication  and  data  balancing  algorithms  are  presented 
in  Chapter  5.  Here,  we  discuss  the  replication  strategies  and  discuss  the  results  on 
their  performance.  We  next  discuss  the  various  data  balancing  algorithms  on  the 


dfi'tree  &nd  compare  thrir  performaiice.  We  present  tbe  performance  of  tbe 


CHAPTER  2 

SURVEY  OF  RELEVANT  WORK 


In  this  chapter,  we  present  seme  bnckground  on  concurrent  B*lrees,  concurrent 
B-link  algorithms,  the  distributed  B-trec,  and  data  balaoeing  the  distributed  B-trec. 
We  also  provide  a discussion  of  the  paper  by  Johnson  and  Colbrooic  ((2b]).  They 
introduce  a new  baianced  search  tree  aJgoritbm  for  distributed  memory  systems. 
They  use  the  B*link  tree  as  a basis  for  the  distributed  B-tree,  the  dB-tree.  To  reduce 
tbe  cost  of  maintenance  of  the  distributed  B-trec,  a polh  rtpiiaiUim  strategy  is  used, 
wherein  if  a processor  owns  a leaf  node  then  it  also  owns  ail  the  nodes  from  the  root 

initiated  at  any  processor.  The  leaf  level  nodes  are  not  replicated.  The  concept  of 
dala  bfilitncing  has  also  been  introduced  to  balance  the  load  at  all  processors.  They 
present  some  ideas  on  bow  data  balancing  can  be  implemented  using  distributed  B- 
link  tree  algorithms.  FioaUy,  they  also  show  how  the  dB-tree  algorithms  can  be  used 
to  build  a data-baianced  distributed  dictionary,  tbe  dE-lrte. 


Tree  structures  (in  particular  B-trevs)  are  suitable  for  creating  indices.  B-trees 
of  high  order  are  desirable  since  they  result  in  a reduction  of  the  number  of  disk 
accesses  needed  to  search  an  index.  If  the  index  has  IV  entries,  then  a B-ttee  of  order 


m = W + 1 would  have  only  one  level.  An  insertion  which  ■ 


loo  full  spliu  the  node  and  a restructuring  of  the  tree  is  performed. 


Tor  concurmicy  of  several  processes.  The  original  B*tree  algorithms  were  designed 
for  sequential  applications,  where  only  one  process  accessed  and  manipulated  the  B- 
tree.  Tlie  main  concern  of  these  algorithms  was  minimizing  access  latency.  However, 
with  the  growth  of  processing  power  and  the  need  for  parallel  computing,  moxiroizing 
throughput  has  become  important.  The  B-tree  is  suitable  for  concurrent  operations 
by  allowing  individual  proce&ses  to  perform  independent  operations. 

Several  approaches  to  eoncunonl  access  of  the  B-tree  have  been  proposed  [7], 
[37],  [44],  [32].  All  the  algorithms  share  the  problem  of  contention  which  can  be 
categorized  into  two  types:  data  contention  and  resovroe  contentuin.  Both  lead  to 
performance  degradation. 

■ Date  contention:  All  concurrent  search  tree  algorithms  require  a concurrency 
control  technique  to  keep  two  or  more  processes  which  access  the  B-tree  from 
interfering  with  one  another.  This  contention  is  more  pronounced  at  the  higher 
levels  of  the  tree.  All  algorithms  proposed  use  some  form  of  locking  technique 
to  ensure  exclusive  access  to  a node. 

a Resource  contention:  Performance  degradation  is  inevitable  when  several  pro- 
cesses access  s single  resource  in  the  system.  In  shsred-memory,  this  scenario 
occurs  when  more  than  one  process  contends  for  the  same  memory  location,  in 
a distributed  architecture,  contention  occurs  when  one  processor  receives  mes- 
sages requesting  access  to  a node  from  every  other  proeessor.  Sagiv  [49],  and 
Lehman  and  Vao  [37]  use  n link  technique  to  reduce  cootentjon. 

Poroilel  B-lrees  using  multi-version  memory  have  been  proposed  by  Wong  and 
Wcihl  |60]-  The  algorithm  is  designed  for  software  cache  management  and  is  suitable 
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for  I 


memory  allows  a process  to  road  an  “old  version"  of  data.  Therefore,  individual  read 
and  write  appear  no  lon^  atomic.  A muilnveruion  memory  thus  allows  a data  read 


not  invalidate  messages  from  icpiicatcd  copies. 

Multi-disk  B-treeu  have  been  proposed  by  Seeger  and  Larson  |53].  They  propose 
three  different  strategies  for  distributing  the  data  stored  on  a B-tree  over  multiple 
disks,  record  distribution,  large  page  B-trees  and  page  distribution.  Local  and  global 
load  balancing  is  also  addressed.  The  main  focus  of  the  paper  is  the  throughput  of 
the  system.  Local  load  balancing  is  found  to  signiHcanlly  reduce  the  response  time 
for  range  queries. 


A B-tree  of  order  m is  a tree  that  salisliea  the  following  condilioos: 

I . Every  node  has  no  more  than  m cbildrert 

3.  The  root  has  at  least  two  niiildren  and  every  other  internal  node  has  at  least 
m/i  children. 

3.  The  distance  from  the  root  to  any  leaf  is  the  same. 

A search  for  a key  progresses  recursively  down  from  the  root  node.  If  the  root 
node  holds  the  key,  the  search  stops;  otherwise,  it  continues  downwartl.  An  insert 
operation  results  in  an  insertion  if  the  key  is  not  already  in  the  B-tree-  It  the  node 
IS  full  (i.e.,  an  insertion  would  cause  it  to  contain  m+I  keys),  the  node  splits  and 
transfciu  half  its  keys  (m/S)  to  the  new  sibling,  and  a pointer  to  the  sibling  is  placed 


to  progress  coocurrentiy  with  a data  write.  Also.  **cache  misses"  are  eliminated,  since 
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in  Ibe  parent.  If  the  iaserlioD  causes  the  parent  to  split,  the  split  moves  upward 
recursively.  A delete  searches  for  the  key  and  removes  it  from  the  leaf  node  when 
found.  If  the  code  has  less  than  m/S  keys,  it  is  nierged  with  either  sihliog,  Thia 
technique  is  known  as  merpe>nl*hnl/ technique.  A better  option  is  to  /rec-fll-spiplv  “ 
delete  nodes  only  when  they  arc  empty. 

A variant  of  the  B-lrce  known  as  the  fl't-tree  stores  the  data  only  at  the  leaf 
nodes.  This  structure  is  much  easier  to  implement  than  a B*tree.  A B-link-trec  is  a 
B^-tree  in  which  every  node  hns  a pointer  to  iks  right  sibling  at  the  same  level.  The 
link  provides  a means  of  reaching  a node  when  a split  has  occurred,  thereby  helping 
the  node  to  recover  from  misnavigaled  operations.  The  B'lJnk'tree  ^gorithms  have 
been  found  to  have  the  highest  performance  of  all  concurrent  B'tree  algorithms  [22]. 
lu  the  concurrent  B*link*trcc  proposed  by  Sa^v  [49],  every  node  has  a held  that  is 
the  higbest'valued  key  stored  in  the  subtree. 

A search  operation  starts  at  the  root  node  and  proceeds  downwards.  In  this 
algorithm  at  most  only  one  node  is  locked  nt  any  time.  A search  first  places  an  R 
(read)  lock  on  the  root,  then  finds  the  correct  child  to  follow.  Neirt,  the  root  node 
is  unlocked  and  the  child  is  R locked.  Having  reached  a leaf  node,  the  search  finds 
the  correct  leaf  node  (i.t,  the  one  whose  highest  value  is  greater  than  the  key  being 
searched  for)  by  traversing  the  right  links  in  a node.  The  search  returns  a success  or 
failure  depencling  ou  the  presence  of  the  key  in  the  leaf  node  or  not  ( 2.1). 

An  insert  operation  works  in  two  phases:  a search  phase  and  a restructuring 
phase  (23).  The  difference  between  the  search  phase  of  an  insert  operation  and  the 
search  operation  described  above  is  that  here  the  R lock  on  the  leaf  nodes  is  replaced 
by  a W (eaclusive  write)  lock.  The  key  is  inserted,  if  not  already  present,  in  tbe 
appropriate  leaf.  If  the  insert  causes  a leaf  node  to  become  loo  full,  a split  occurs 
and  the  restructuring  begins  as  in  the  usual  B-lrce  algorithm.  Since  the  operations 


24 


Figu 


2.  OperatioQ 


bold  at  most  only  ooc  lock  at  a lime,  restructuring  fxnist  I>e  separated  into  disjoint 
operations.  The  first  phase  is  to  perform  a holf-iptil  operation  (Figure  2.2).  During 
this  phase,  a new  node,  the  sibling,  is  created  and  half  the  keys  front  the  original  node 
are  transferred  into  it.  The  sibling  is  put  into  the  leaf  list  and  the  sibling  pointers 
ace  adjusted  appropriately.  The  neat  phase  is  to  inform  tbe  parent  of  the  split.  Now 
the  lock  on  tbe  leaf  node  is  released,  the  parent  node  is  locked,  and  a pointer  to  the 
sibling  is  insetted  into  the  parent.  During  the  lime  that  the  split  occurs  and  the 
pointer  is  inserted  into  the  parent,  operations  navigate  to  the  sibling  via  the  link 
and  the  higlienC  lields  in  tfie  node. 

On-lho-fly  node  deletion  is  not  supported  in  shatod-meniory  multiprocessors.  Sev- 
eral alternatives  to  on-the-By  deletion  exist,  including  never  deleting  nodes,  perform- 
ing  garbage  collection  or  leaving  the  deleted  nodes  as  stubs  without  deallocating  them 
physically. 


2.3  Thi.HB-trw 


JohQson  end  Colbrook  |23]  pment  & distributed  B- tree  suitable  for  message  pass* 
log  architectures.  The  intmor  nodes  ore  replicated  to  improve  poralielism  and  olie* 
viate  the  bottleneck.  The  processor  that  owns  a leaf  owns  oil  the  nodes  on  the  path 
Croin  the  root  to  the  leaf.  Restructuring  decisions  ore  mode  locallyr  thereby  reducing 
the  communication  overhead  and  inrrcaaing  parallelism.  The  paper  also  deals  with 
the  data  haloncing  among  processors. 

The  dB-tree  is  built  upon  the  concurred  B*link  algorithms.  In  the  dB-tree,  the 
leaves  are  distributed  among  processors.  The  interior  nodes  are  repUcated  among 
the  processors.  Every  processor  on  a level  has  links  to  both  its  neighbors.  Also, 
each  node  stores  the  distance  from  the  leaves.  Nodes  of  the  dB-tree  are  given  unique 
laps.  A processor  increments  a node  counter  on  the  creation  of  a oode.  The  tag 
is  a concatenation  of  a node  counter  at  a processor  and  the  procensor  mtmirer.  A 

The  operntioiia  iiiscri,  dehlt  and  seorcA  are  defined  on  the  dB-tree.  Correspond- 
ing to  each  operation,  actions  are  performed  on  the  nodes  of  the  tree.  A processor 
accepts  messages  from  other  processors  for  performiog  the  operations.  Misnavigated 
messages  ore  routed  to  the  correct  processor.  When  a node  iiecomes  full,  it  “half- 
splits".  The  double  links  of  a node  help  in  |>erforraing  the  half-split.  Similarly,  when 
a node  merges  into  another  node  or  becomes  empty,  it  must  be  deleted  from  the  tree. 
A Aat/-merpe  procedure  is  used.  All  links  to  a merged  node  must  be  chaiigerl  irefore 
a merged  node  con  actually  be  deleted  from  the  tree. 

2A  Replicalion 

The  niulli-version  memory  algorithm  proposed  by  Wang  acd  Weihl  [80]  reduce 
the  amount  of  syuchroniaation  and  commuoication  needed  to  maintain  replicated 


27 


copies,  thus  reducing  the  ejTect  of  resource  contention.  Several  slgorithme  have  been 
proposed  for  replicating  a node  [8].  Lasy  replication  has  been  proposed  by  Ladin  et 
al.,  for  replicating  servers  (34|.  The  servers  appear  to  be  logically  centralized,  in  spite 
of  their  physical  distribution.  Replicas  communicate  information  among  themselves 
by  lazily  exchanging  gossip  messages.  Johnson  and  Krishna  [26]  have  proposed  fixed' 
copy  and  variable>CQpy  algorithms  for  lazy  updates  on  a distributed  B*tree, 


be  achieved  by  locking  every  copy  of  the  node  that  is  to  be  modihed  and  blocking 
all  reads  and  updates  on  the  node.  However,  this  is  too  restrictive.  Johnson  and 
Colbrcmk  maintain  replica  coherency  with  far  less  eynchronization  and  overhead. 
Only  the  modification  is  distributed  to  the  copies,  not  the  entire  node  contents. 
A node  is  never  in  a incorrect  state,  hence  reading  need  not  “block” , Also,  most 
modifying  actions  commuteso  the  order  in  which  they  are  performed  does  not  matter. 
In  chapter  3.2,  we  will  see  hovr  two  pending  inserts  al  a parent  can  be  performed  in 
any  order  at  the  various  copies  of  the  parent. 

However,  not  all  actions  on  a node  can  be  performed  in  an  arbitrary  order.  If 


performed  8rsl  leads  to  a split  in  the  node  in  one  copy  while  none  in  the  other  copy. 
The  problem  is  the  ordering  of  tbc  split  with  the  insert  or  delete.  Johnson  et  al., 
present  correctness  criteria  for  the  data  .structure. 

They  categorize  actions  on  nodes  as  being  lory,  semi-synchronous,  or  synchronoos 
according  to  the  amount  of  synchronisation  required  to  perform  the  action.  A tasy 
action  docs  not  need  to  synchronize  with  other  laay  actions.  A semi-synchronous 
action  must  synchronise  with  some,  but  not  all  other  actions.  A synchronous  action 


an  insert  and  a delete  are  pending  on  two  copies  of  a full  node, 


IB  that  whicb  must  be  ordered  with  all  other  actions,  or  that  requires  communication 

Johnson  and  Krishna  |26|  present  a framework  for  creatiog  and  analyzing  lazy 
update  algorithms.  The  traniework  is  used  to  develop  algorithms  that  can  manage 
a dB'tree  nnde.  The  algorithm  uses  lazy  insert  actions  and  semi'Synchronous  half- 
spill  actions.  In  addition,  the  algorithm  framework  accounts  for  ordered  actions  to 
require  that  classes  of  actions  are  performed  on  a node  in  the  order  in  which  they  are 
generated  (i.e.  tile  Unk-change  actions  arc  ordered). 


perform  data  balancing  on  tile  processom.  The  balancing  also  spreads  the  queries 
to  the  data  structure  evenly  among  processors.  It  also  provides  equal  memory  and 
space  utilization  at  each  processor. 

Data  balancing  among  processors  bon  been  studied  by  Johnson  and  Coibrook  [23]. 
They  suggest  a way  of  reducing  communication  cost  for  data  balancing  by  storing 
neighboring  leaves  on  the  same  processor.  When  a processor  decides  that  it  has  too 
many  leaves  it  looks  at  a processor  holding  adjacent  leaves.  II  that  processor  accepts 
the  leaves,  the  excess  loaves  are  IraoBferred.  If  no  neighboring  processor  is  lightly 
loaded,  the  heavily  loaded  processor  looks  for  a lightly  loaded  processor  and  transfers 
the  leaves. 

lo  the  context  of  node  mobility,  object  mobility  lias  lieen  proposed  in  Emerald 
129J-  Objects  keep  forwarding  information  even  after  tliey  have  moved  to  another 
node  and  use  a broadcast  protocol  if  no  forwarding  information  is  available. 


Lee  el  al,  |36]  have  discussed  a faull-toleraDl  scheme  for  distributing  queues.  Tile 
scheme  described  by  them  provides  dynamic  fauit  tolerance,  high  availability  and  uni- 

Higti  availabiiity  ie  achieved  by  replication  of  the  queue  and  each  queue  replica  may 
be  distributed  over  several  dies.  Consistency  is  mmotained  by  tvro'phase  locking. 

tion  overhead  is  low.  However,  every  queue  access  requires  communication  to  ensure 
global  consistency.  When  a processor  issues  a queue  operation,  it  sends  a request 
to  the  processor  containing  the  head  or  tail  of  the  queue.  On  recavlng  the  request, 
the  current  head  or  tail  processor  will  lock  up  all  other  head  or  tail  queue  replicas, 
thereby  cosuriog  consistency.  If  the  processor  which  reemves  a request  does  not  hold 
the  head  or  t^l,  it  forwards  the  request.  The  chasing  continues  until  the  processor 
bolding  the  head  or  tail  is  found. 

Ellis'  nigorithm  [Uj  pertbrms  data-balancing  whenever  a processor  runs  out  of 
storage.  Peleg  (46]  has  studied  the  issue  of  data-balaoeiiig  in  distributed  dictiennries 
from  a complcrity  point  of  view,  requiring  that  no  processor  store  more  thao 
keys,  wlierc  M is  the  number  of  keys  and  N Is  the  number  of  processors.  In  practice, 


and  node  capacities. 

in  the  dB-tree,  the  data  balancing  is  performed  by  distributing  the  leaves  among 
the  processors.  This  requires  coinmuniistion  among  the  processors  each  lime  a leaf 
moves  to  update  sibUng  and  parent  links.  Also,  the  number  of  interior  nodes  repli' 
cated  Is  high.  An  alteroalive  to  the  dB-tree  ie  the  dE-trec- 
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2.6  dR-trog 

To  reduce  the  commufiicfttiou  cost,  Jobnson  and  Colbrook  sug|est  the  dE-tree, 
also  koown  as  the  distributed  extent  tree,  where  neighboring  leaves  are  stored  on  the 
same  processor.  They  dehue  an  erient  to  be  a maximal  ieogth  sequence  of  nmghboriog 
leaves  that  arc  owned  by  the  same  processor.  When  a proces.sor  decides  that  it  owns 
too  many  leaves,  it  first  looks  at  the  processors  who  own  oelghboring  extents.  If 
the  neighbor  will  accept  the  loaves,  the  processor  transfers  some  of  its  leaves  to  the 
neighbor.  If  no  neighboring  processor  is  Ughtly  loaded,  the  heavily  loaded  processor 
searches  for  a lightly  loaded  processor  and  creates  a new  extent. 

Figure  3.3  shows  a four  processor  dB*lree  that  is  data  balanced  using  the  extents. 
The  extents  have  the  characteristics  of  a leaf  in  the  dB-trec:  they  have  an  upper  and 
lower  range,  are  doubly  linked,  accept  the  dicUooaty  operations,  and  are  occasionally 
split  or  merged.  The  extent.balanccd  dB-lree  can  be  treated  as  a d&trea  Each 
processor  manages  a number  of  extents.  The  keys  stored  in  the  extent  are  kept  in 
some  coovenient  data  structure.  Each  extent  is  linked  with  its  neighboring  extent. 

The  extents  are  managed  as  the  leaves  in  a dB-lrcc.  When  a processor  decides 
that  it  is  too  besvily  loaded,  it  first  looks  at  the  neighboring  extents  to  take  some 
of  its  keys.  K all  neigkboring  processors  are  heavily  loaded,  a oew  extent  is  created 
for  a lightly  loaded  processor.  1’he  creatioo  and  deletion  of  extents,  and  the  sbiftiog 
of  keys  between  extents  in  the  dE-tree  correspond  to  s|)Utting  and  merging  leaves  in 
the  dB-lree.  and  the  index  can  be  updated  by  using  dB-trce  algorithms. 

Since  a processor  can  store  many  keys,  the  index  siseis  proportional  to  the  number 
of  processors.  Also,  iodex  restructuring  is  greatly  reduced  as  It  place  only  after 
a large  number  of  keys  have  been  inserted  or  deleted. 

The  dE-tree  can  be  used  to  maintain  striped  file  systems  ([27]). 
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2.fi.l  SCripcH  Filp  Systems 

Parallel  file  systems  have  been  pntptnted  to  better  match  10  throughput  to  pro- 
cesoiug  power.  A pantlil  filuiilcm  Is  a file  system  in  which  the  files  are  stored  oo 
multiple  disks  and  the  disk  drives  are  located  on  dllfcrent  processors.  A common 
method  for  implementing  a parallel  KJesystem  is  to  use  disk  alriping  [51],  in  which 
consecutive  blocks  in  a file  ace  stored  on  difTerent  disk  drives,  each  disk  hai  its  own 
controller.  A parallel  striped  file  system.  Bridge,  has  been  implemented  on  the  BBN 
butterfly  |33|.  A striped  file  can  be  appended  (or  prepended)  to  and  maintun  its 
structure.  However,  a block  can't  Ire  inserted  into  or  deleted  from  the  middle  of  the 
file,  since  doing  so  would  destroy  the  regular  striping  structure  of  the  file  because 
of  an  out-of-order  block  or  a gap.  A reorganisation  of  the  file  is  required.  Bridge, 
however,  does  rrol  support  these  operations,  lo  many  applications,  the  most  common 
operations  on  the  file  are  “read*  and  “appeitd’’,  so  striping  reduces  latency.  Certain 
other  applications  use  “inserts’  and  “deletes"  from  the  middle  of  the  file. 


32 

In  lh«  indexed  elriped  Hie  proposed  by  Johnson  [27],  the  file  consists  of  a single 
extent  initially  An  insert  or  delete  calls  for  decisions  to  be  made  about  re-organixing 
the  extents.  The  inserts  in  the  extent  that  cause  a split  correspond  to  the  splittiog 
of  a node  in  our  6*tree  and  joining  of  two  extents  corresponds  to  a merging  of  the 
nodes.  Thus,  the  dB*tree  algorithms  can  provide  an  index  structure  which  allows 
one  to  insert  into  or  delete  from  a striped  file,  and  further,  as  the  striped  extents 
are  linked  together,  the  file  can  be  sequentially  read  in  a highly  parallel  manner  to 
prrrvide  fast  random  access.  Direct  access  to  the  liitt  is  also  fast. 

The  assumption  is  that  the  file  is  composed  of  natixiji,  each  of  which  can  he 
identified  hy  a key  which  in  turn  can  btt  trrdered.  This  assumption  is  reasonable, 
because  the  meaning  of  ‘insert  this  data  after  the  lOO-tb  block  in  the  fHc*  loses 
meaning  when  data  blocks  are  being  inserted  and  deleted  conairreatly. 

Tire  dE-tree  is  appropriate  for  a file  index  structure  which  allows  insertions  and 
deletions  in  the  middle  of  a parallel  striped  file  (if  the  records  in  the  file  are  ordered), 
and  that  permits  fast  random  access  and  highly  parallel  block  reads,  losleadofmun- 
tsining  a single  striped  file,  n sequence  of  indrrpendently  striped  extents  is  maintained, 
i.e.  a striped  file  is  broken  into  extents,  and  an  index  into  the  extents  is  kept.  The 
idea  is  that  on  an  insertion  or  a deletion,  cuthcr  the  extent  can  be  rerrrganised  or  a 
new  extent  created.  The  dB'tree  index  helps  to  manage  the  striped  extents. 

An  example  of  an  indexed  striped  file  is  shown  in  figure  2.4.  The  file  is  broken 
into  a number  of  extents,  each  of  which  is  independently  striped  across  M disks  (i.e,  a 
sloped  exlcnl).  TheexlenUareindexed  by  a dB-tree.  The  index  is  used  for  maoaging 
the  extents,  as  well  as  for  providing  an  index  for  random  accas. 


refereace  lo  the  indee 


2.7  Concluaion 

b Ihis  chapter,  we  have  presented  a background  on  concurrent  B-ttees  and  the 
liistcibuted  B-troc.  We  have  discussed  the  work  done  by  Johnson  and  Colbrook 
([23]).  They  present  some  ideas  on  the  implementation  of  a distributed  B*tree  and 
also  present  some  techniques  to  avoid  the  root  bottleneck  by  replication  of  the  interior 
nodes.  Further,  they  discuss  some  ideas  on  data-balancing  the  processors  which  hold 
the  distributed  B-tree.  Concurrency  control  and  replica  coherency  are  also  addressed. 
To  reduce  the  cost  of  communication  for  dala-balancing,  they  suggest  the  distributed 
ealent  tree.  The  dE-tree  raanagw  extents  instead  of  individual  keys.  Jnhnson  ([27| 
provides  a discussion  of  how  the  dE-tree  can  be  used  for  a practical  application  of 
striped  file  systems,  b the  next  chapter,  we  provide  a theoretical  framework  of  the 
algorithms  for  replication  of  the  distributed  B-trec. 


CHAPTERS 

REPLICATION  ALGORITHMS 
3.1  Introduction 

When  addressing  large  volumes  of  dale,  there  is  a danger  of  memory  bottlenecks, 
where  all  processors  access  the  same  data  item  stored  at  one  processor.  For  example, 
one  of  the  problems  with  a distributed  search  structure  is  that  since  all  accesses 
to  the  data  have  to  pass  via  the  root  node,  the  toot  code  becomes  a bottleneck  and 
overwhelms  the  node  which  stores  it  (as  noted  in  [6|).  It  also  creates  excessive  message 
traffic  in  the  network  towards  the  processor  which  holds  the  root  node  of  the  search 
structure.  This  is  known  as  resource  contention  and  can  be  solved  by  replication. 
Aliowiug  multiple  copies  of  often  accessed  nodes  distributes  the  work  load  amoog 
the  componeets  of  the  system.  Replication  while  providing  redundancy,  availability 
and  improving  concurrency,  however,  introduces  consisteocy  problems  previously  not 
present.  A method  of  achieving  consistency  is  to  guarantee  that  all  operations  take 
place  in  the  same  order  at  all  the  sites  of  the  distributed  system. 

Several  algorithms  have  been  proposed  for  replicating  a node  [8]-  These,  however 
do  so  at  the  cost  of  concurrency  siitce  they  require  synchroniaalion  and  thus  create 
significant  coniniiinication  overhead.  Lazy  replication  has  been  proposed  by  Ladin 
and  Liskov  for  replicating  servers  |34j.  The  servers  appear  to  be  logically  centralized, 
in  spite  of  their  physical  distribution.  Replicas  communicate  information  among 
themselves  by  lazily  excbangiDg  gossip  messages.  This,  however,  creates  the  following 
problem.  Consider  two  dilferenl  ©iierations,  e and  6,  that  are  causaUy  related  but 
executing  at  different  refilicas,  A and  B.  If  operation  i is  dependent  on  the  previous 


cme,  a,  Ihe  replic&  which  receives  b,  i.e.,  B.  does  not  have  enough  informatioc  about 
it  to  proceed.  The  rephca,  B,  has  to  delay  the  operation  of  b until  h receives  all  the 

Techniques  exist  to  reduce  the  cost  of  maintaining  replicated  data  and  for  Increas* 
ing  concurrency.  Ladin,  Lisknv,  and  Shira  propose  lucjr  replication  for  maintaining 
replicated  servers  [34]-  Lany  replication  uses  the  dependencies  that  exist  in  the  opera- 
tions to  determine  if  a server's  data  is  sufficiently  up-to-date  to  execute  a new  request. 
Several  authors  have  explored  the  construction  of  nou-blocking  and  wait-free  concur- 
rent data  structures  lo  a shared-memory  cDvironment[17j.  These  algorithms  eohance 
concurrency  because  a slow  operation  never  blocks  a fast  operation. 

In  this  chapter,  we  present  an  approach  to  maintaining  distributed  data  struc- 
tures which  uses  Incjf  spdalrs,  which  take  advantage  of  the  semantics  of  the  search 
structure  operations  to  allow  for  scalable  and  low-overhead  replication.  Laiy  up- 
dates can  be  used  to  design  distributed  search  structures  that  support  very  high 
levels  of  concurrency.  The  alternatives  to  lazy  update  algorithms  (vigorous  updates) 
use  synchronization  to  ensure  consistency. 

Lazy  update  algorithms  are  similar  to  lazy  replication  algorithms  because  both 
use  the  semantics  of  an  operation  to  reduce  the  cost  of  maintaining  replicated  copies. 
The  effects  of  an  operation  can  be  lazily  sent  to  the  other  servers,  perhaps  on  piggy- 
backed messages.  The  lazy  replication  aJgorillim  blocks  an  operation  nntil  the  local 
daU  is  sufficiently  up-to-date,  lo  contcnsl,  a non-blocking  wait-free  concurrem  data 
structure  never  blocks  an  operation.  The  lazy  update  algorithms  are  similar  in  that 
the  execution  of  a remote  operation  never  blocks  a local  operation;  hence,  they  are  a 
distributed  analogue  of  non-blocking  algorithms. 

Lazy  updates  have  a number  of  pragmatic  advantages  over  more  vigorous  repli- 
cation algorithms.  They  aifnificantly  reduce  maintenance  overhead.  They  are  highly 


coDcurrenl,  since  they  permit  concurrent  reeds,  reads  concurrent  with  updates,  and 
concurrent  updates  (at  different  nodes).  Since  lazy  updates  avoid  the  nse  of  synchro- 
nization, they  ore  much  easier  to  implement  than  vigorous  update  algorithms. 

Despite  the  beneHls  oftbe  lazy  update  approach,  Implementors  might  be  reluctant 
to  use  it  without  correctness  guarantees.  We  develop  a correctness  theory  for  lazy 
updates  so  that  our  algorithms  con  be  applied  to  other  distributed  search  structures. 
We  demonstrate  the  application  oflazy  updates  to  the  dB-lree,  which  is  a distributed 
B*  tree  which  replicates  its  interior  nodes  for  highly  parallel  accesa(23]. 

We  present  three  algorithms,  the  last  of  which  con  implement  a dB-lree  which 
never  merges  nodes  and  performs  data  balancing  on  ieof  nodes  (we  have  previously 
found  that  never  merging  nodes  results  in  little  loss  in  space  utilisation  [21],  and 
data  balancing  on  the  leaf  level  is  low-overhead  and  effective  [30]).  The  methods  we 
present  con  be  applied  to  other  distributed  search  structures,  such  as  hash  tables  [14], 

Before  we  terminate  this  introduction,  we  should  mention  some  usofnl  character 
islics  of  lazy  updates.  First,  when  a lazy  update  is  performed  at  one  copy  of  a node, 
it  must  also  be  jierfonned  at  the  other  copies.  Since  the  lazy  ujrdate  commutes  with 
other  updates,  there  is  no  pressing  need  to  inform  the  other  copies  of  the  update  im- 
mediately. instead,  the  lazy  update  can  be  piggybacked  onto  messages  used  for  other 
purposes,  greatly  reducing  the  cost  of  replication  management  (this  is  similar  to  the 
l«y  TtpliMliou  techniques  [34]).  Second,  indez  node  searches  and  updates  commute, 
BO  that  one  copy  of  a node  may  be  read  while  another  copy  is  being  updated.  Further, 
two  updates  to  the  copies  of  a node  may  proceed  at  the  same  time.  As  a result,  the 
dB-tree  not  only  supports  concurrent  read  actions  on  different  copies  of  its  nodes,  it 
supports  concurrent  reads  and  updates,  as  well  as  concurrent  updates. 


Figure  3.1.  A dB-lr«« 


3-2  Henlicttioii 

All  opcraliODs  start  by  acceasing  Iho  root  of  the  search  structure.  If  there  is  only 
one  copy  of  the  root,  then  access  to  the  index  is  serialised.  Therefore,  we  want  to 
replicate  the  root  widely  in  order  to  improve  parallelism.  As  wr  increase  the  degree 
of  replication,  however,  the  cost  of  maintaining  coherent  copies  of  a node  increases. 
Since  the  root  is  rarely  updated,  maintaining  coherence  at  the  root  isn’t  a problem. 
A leaf  is  rarely  accessed,  but  a si|ulficant  portion  of  the  accesses  are  updates.  As  a 
result,  wide  replication  of  leaf  nodes  Is  prohibitively  expensive. 

In  the  dB-tree  the  leaf  nodes  are  stored  on  a single  processor.  We  apply  the  rule 
that  if  a processor  stores  a leaf  node,  it  stores  every  node  on  the  path  from  the  root 
to  that  leaf.  An  example  of  a dB-tret  which  uses  this  replication  policy  is  shown 
in  Figure  3.1.  The  dB-trec  replication  policy  stores  the  root  everywhere,  the  leaves 
at  a single  processor,  and  the  inlermediate  nodes  at  a moderate  level  of  replication. 
As  a result,  an  operation  can  be  initiated  at  every  processor  simultaneously,  but  the 
effects  of  updates  are  localised-  As  a side  effect,  an  operation  can  perform  much  of 
its  searching  locally,  reducing  the  number  of  messages  passed. 

The  replication  strategy  for  a dB-tree  helps  to  reduce  the  cost  of  maintaining  a 
distnbutedsearchatructure,  hut  the  replication  strategy  alone  is  not  enough.  If  every 
node  update  required  the  execution  of  an  available-copies  algorithm  |8],  the  overhead 
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Figure  3.2.  Ls2y  inuerts 


of  maiutaiDiDg  replicated  copies  would  be  probibitive.  lustead,  we  take  advantage  of 
the  semantics  of  the  actions  on  the  search  structure  nodes  and  use  fuev  updoles  to 
TOaintaiD  the  replicated  copies  inexpensively. 

We  note  that  many  of  Ibe  actions  on  a dS'trcc  node  cornmute.  For  example, 
consider  the  sequence  of  actions  which  occurs  in  Figure  3.2.  Suppose  that  nodes  A 
and  B split  at  “about  the  same  time."  Pointers  to  the  new  siblings  must  be  insetted 
into  the  parent,  of  which  there  are  two  copies.  A pointer  to  A’  is  inserted  into  the 
first  copy  of  the  parent  and  a pointer  to  B’  is  inserted  into  the  second  copy  of  the 
parent.  At  this  point,  the  searcii  structure  is  inconsistent,  since  not  only  does  the 
parent  not  contmn  a pointer  to  one  of  its  children,  but  the  two  copies  of  the  parent 
don’t  contain  the  same  value. 

The  tree  in  Figure  3.2  is  still  usable,  since  no  node  has  been  made  unavailable. 
Further,  the  copies  of  the  parents  will  eventually  converge  to  the  same  value.  There- 
fore, there  is  no  need  for  one  insert  action  on  a notlc  to  synchroniee  with  another 
insert  action  on  a node.  The  tree  is  alwaya  navigable,  so  the  execution  of  an  insert 
doesn't  block  a search  action.  We  call  node  actions  with  .such  loose  synchroniaatioo 
requirements  lazy  apdnlra 


Shasha  and  Goodman  [56]  provide  a framework  for  proving  the  cortectneas  of  noo* 
replicated  concurrent  data  atnicturea.  We  make  extensive  use  of  their  framework  in 
order  to  discuss  operation  correctness.  We  delete  most  details  here  to  save  space,  but 
we  note  that  if  the  distributed  analogue  of  a linh*lype  search  structure  algorithm  fol- 
lows tbeShaaha-Goodman  link  algorithm  guidelines,  it  will  produce  strict  scrialiaable 
(or  lineariaable)  executions.  However,  we  would  like  the  distributed  search  structure 
to  satisfy  additional  correctness  construnts.  For  example,  when  a distributed  com- 
putation terminates,  every  copy  of  a node  .should  have  the  same  value.  Performing 
concurrency  cnctrol  nn  the  copies  is  discussed  in  the  following  sections. 

We  intuitively  want  the  replicated  nodes  of  the  distributed  search  structure  to 
contain  the  same  value  eventually.  We  can  ensure  the  coherence  of  the  copies  by 
scrialiaing  the  actions  on  the  nodes  (perhaps  via  an  “avulable-copies”  algorithm  [8]]. 
However,  we  want  to  be  lerjt  about  the  maintenance.  In  this  section,  we  describe  a 
model  of  distributed  search  structure  computation  and  establish  correctness  criteria 
for  lazy  updates, 

A node  of  the  logical  search  structure  might  he  stored  at  several  dilTerenl  proces- 
sors. We  say  that  the  physically  stored  replicas  of  the  logical  node  are  copies  of  the 
logical  node.  We  denote  by  copies, (n)  the  set  of  copies  that  correspond  to  node  n at 
[global  snapshot]  timet. 

An  operation  is  performed  by  executing  a sequence  of  actions  oo  the  copies  of 
the  nodes  of  the  search  structure.  Thus,  the  specification  of  an  action  on  a copy  has 
two  components:  a final  value  o'  and  a subsequent  action  set  SA.  An  action  that 
modifies  a node  (an  apiale  action)  is  performed  on  one  of  the  copies  first,  then  is 


re/ave^  aclioDs.  Thus,  the  specification  of  an  action  is; 

a'(p,c)  = (c',S/l) 

in  SA  U of  the  form  indicating  that  action  a,  with  parameter  p,  should 

be  performed  on  copy  c,.  If  copy  Cj  is  stored  locally,  the  processor  puts  the  action  in 
the  set  of  executable  actions.  If  c,  Is  stored  remoteiy,  then  the  action  is  sent  to  the 
processor  which  stores  If  the  action  is  a return  rafue  action,  a message  containing 
the  return  value  is  sent  to  the  processor  that  initiated  the  operation,  if  the  Hnal  value 
of  o(p,c)  is  c for  every  valid  p and  o,  then  n is  a noti-updule  action;  otherwise,  a is 
an  update  action.  The  superscript  I is  either  i or  r.  Indicating  an  initial  or  a relayed 
action.  We  also  distinguish  initial  actions  by  writing  them  in  capitals,  and  relayed 
actions  by  writing  them  in  lowercase  (i.e..  I and  i for  an  Insert). 

In  order  to  discuss  the  commutativity  of  actions,  we  will  need  to  specify  whether 
the  order  of  two  actions  can  be  exchanged.  If  actioo  a*  with  parameter  p can  be  per- 
formed on  c to  produce  subsequent  action  set  SA,  then  the  action  is  valid,  otherwise 
the  action  is  jnualid.  We  note  that  the  validity  of  an  action  does  not  depend  on  the 
final  value. 

Au  algorithm  might  require  that  some  actions  must  be  performed  on  all  copies 
of  a node,  or  oo  oil  copit's  of  several  nodes  'Simultaneously,”  Thus,  we  group  some 
action  sequences  inlo  alomie  actiaa  scqaenca,  or  AAS.  The  execution  of  an  AAS  at 
a copy  is  initiated  by  an  AAS.slaH  action  and  terminated  by  an  AAS-^nisA  action. 
A copy  may  run  one  or  more  AAS  simultaneously.  An  AAS  will  commute  with  some 
actions  (possibly  other  AAS  jtarl  actions),  and  conflict  with  others.  We  assume  that 
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the  node  mansger  at  c»ch  processor  is  aware  of  the  AAS-action  conflict  relationships, 
and  will  block  actions  that  conflict  with  currently  executing  AAS.  The  AAS  is  the 
distributed  analogue  of  the  shared  memory  lock,  and  can  be  used  to  implement  a 
simitar  kind  of  synchroniaatiou.  However,  laay  updates  are  preferable. 

3.4.1  Histories 

In  order  to  rrapture  the  conditions  under  which  actions  on  a copy  commute,  we 
model  the  value  of  a copy  by  its  history  (as  in  [18]).  Formally,  the  total  hiitory  of 
copy  c 6 copiesi(n}  consists  of  the  pair  where  4 is  the  initial  value  of  c 
and  A[  is  a totallyrordercd  set  of  actions  of  c.  We  define  correctness  in  terms  of  the 
update  actions,  since  non-update  actions  should  not  be  required  to  execute  at  every 
copy.  The  {ofdatt)  hiatomo!  a copy  is  a pair  (4,  Ac)  where  4 is  the  same  initial 
value  as  in  the  total  history,  and  A<  is  with  the  non-update  actions  deleted  (and 
the  order  on  the  update  actions  preserved).  To  remove  the  distinction  between  initial 
and  relayed  actions,  we  define  the  tmiferm  history,  U{H)  to  be  the  update  history  }l 
with  each  actirni  o'  replaced  by  o.  Finally,  we  will  write  the  history  of  copy  c,  (4,  A,) 
aa  H,  = 4n[ii  of,  wliere  = (nj,a’ o^). 

Suppose  that  N,  = 4n“,  o„  and  that  4 h the  final  value  of  H'  = /’IlJ.i  “i- 
Then  W;  = (f'Hjsi  aj)  o,  is  the  beekwardo  exttnaion  of  /4  by  H’.  It  is  easy  to 
see  that  He  and  have  the  same  value,  and  the  last  m actions  in  H'  have  the  same 
subsequent  action  sets  as  the  m actions  in  H,.  When  a node  is  created,  it  has  an 
initial  value,  4.  When  a copy  of  a node  is  created  it  is  given  an  initial  value,  which 
we  call  the  ongiaai  value  of  the  copy.  This  initial  value  should  be  chosen  in  some 
meaningful  way,  and  will  typically  be  equivalent  to  the  history  of  the  creating  copy, 
or  to  a synthesis  of  the  histories  of  tire  existing  copies,  in  either  case,  the  new  copy 
will  have  a backwards  extension  which  corresponds  to  the  history  of  update  actions 
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performed  on  the  copy.  If  a copy  of  anode  is  deleted,  then  we  no  longer  need  to  worry 
about  the  node  conlenLs.  We  denote  a set  of  nil  initial  update  actions  performed  on 
node  n by  m.- 

We  recall  that  an  action  on  a copy  is  valid  if  the  action  on  the  current  value  of 
the  copy  has  its  associated  subsequent  action.  A history  is  valid  if  action  Sj  is  vnlid 
on  /crij.!  of  for  every  i = 1,...  ,m.  The ^nsl  coltie  of  a history  is  the  final  value  of 
the  last  action  in  the  history.  'IVo  histories  ate  rompatibU  if  they  are  valid,  have  the 
same  final  values,  and  have  the  same  uniform  updates.  If  Hi  and  Hi  are  compatible, 
then  we  write  Hi  s H,. 

Our  correctness  criteria  for  the  repiica  maintenance  algorithms  are  the  follcwing: 

Compatible  History  Requirement:  A node  n with  initial  value  U and  update 
action  set  Mn  has  compalible  liislories  if,  at  the  end  of  the  computation  C, 

I.  every  copy  c € eojhes(n)  with  history  H,  has  a backwards  extension  B,  such 
that  the  update  actions  in  //'  = Bg\He  contains  exactly  the  actions  in 
1.  every  backvfatds  extension  can  be  rearranged  to  form  H^such  that  = 
for  every  c.,ti  6 copies(n),  and  every  H\  is  valid. 

If  an  algorithm  guarantees  that  every  node  has  a compatible  liistory,  then  it  meets 
the  compalihle  fiislory  requirement. 

Complete  History  Requirement:  If  every  subsequent  action  issued  appears 
in  some  node’s  update  action  set,  tlieo  the  computation  meets  the  complele  hitUrp 
reqviirmenl.  if  every  computation  that  an  algorithm  produMs  satisfies  the  complete 
history  requiiemeot,  then  the  algorithm  satisfies  the  complete  hulorp  ra/ummenl. 

Ordered  History  Requirement:  We  define  an  orjend  action  as  one  that  be- 
longs lo  a class  r such  that  ail  actions  of  class  r are  time-ordered  with  each  other 
(we  assume  a total  order  exists).  A history  H is  an  ordered  Aisiory  if  for  any  ordered 


actiolu  hi,h^^  H oi  rlasH  r,  if  Ai  <r  A9  then  h,  < h2  in  H.  An  algoritKm  meets  the 
ordenJ  history  nyuinmerU  if  every  node  has  a compatible  history  that  is  an  ordered 
biatoiy. 

The  compatible  history  requirement  guarantees  that  every  node  is  single-copy 
equivalent  when  the  computation  terminates.  We  note  that  our  condition  for  rear* 
raupng  uniform  histories  is  a condition  of  the  subsequent  actioo  sets  rather  than 
a condition  of  the  intermediate  values  of  the  nodes.  The  copies  need  only  to  have 
the  same  value  at  the  end  of  the  computation,  hut  the  subsequent  actions  canT  be 
posthumously  issued  or  withdrawn  without  a special  protocol. 

The  complete  histoty  requirement  tells  us  that  we  must  route  every  issued  action 
to  a copy.  A deleted  node  is  conccptnaliy  ret^ned  in  the  search  structure  to  satisfy 
the  complete  history  requirement.  The  ordered  history  requirement  lets  us  remtne 
otplicit  synchronization  constraints  on  the  equivalent  parallel  algorithm  by  shifting 
the  constraints  to  the  copy  coherence  algorithm. 


An  update  action  must  be  performed  on  aU  copies  of  a node-  With  no  further 


convicting  actions  are  ordered  in  the  same  way  with  all  copies.  However,  some 
actions  commute  with  ttther  almost  all  other  actions,  removing  the  need  for  an  AAS. 
In  Figure  3.2,  the  hnal  value  of  the  node  is  the  same  at  either  copy,  and  the  search 
structure  is  always  in  a good  slate.  Therefore,  three  is  no  need  to  agree  on  the  order 
of  execution.  We  provide  a rou^  taxonomy  of  the  degree  of  synchronization  differeirt 
updates  require. 

Lory  UfiaUr  We  say  that  a search  stnreture  update  is  a lory  opdole  if  it  commutes 
with  all  other  lazy  updates,  so  synchronization  is  not  required. 


information  about  the  action,  it  must  be  performed  via  an  AAS  to  ensure  that  the 


Spmi'synchronoiis  updalt:  Other  updfttee  are  almost  la^y  updates,  but  they  conflict 
with  some  other  actions.  For  example,  the  actions  may  belong  to  a class  of 
ordered  actions.  We  call  these  .semi<syT]chrtmoiis  updates.  A semi-synchronous 
action  requires  special  treatment,  but  does  not  require  the  activatioo  of  an 

Synchronous  Update:  A synchronous  update  requires  an  AAS  for  correct  execution. 
We  note  that  the  AAS  might  block  only  a subclass  of  other  actions,  or  might 
extend  to  the  copies  of  several  dilFerent  nodes. 


In  this  section,  we  describe  algorithms  for  the  laay  maintenance  of  several  different 
dB-tree  algorithms.  We  work  from  a simple  fixed-copitss  dialributed  B-trec  to  a more 
complex  variable-copies  B-tree,  and  develop  the  tools  and  techniques  we  need  along 
the  way.  For  all  of  the  algorithms  we  develop,  we  assume  that  only  search  and  insert 
operstlons  are  performed  on  the  dB-tree.  In  addition,  we  assume  the  network  is 
reliable,  debvering  every  message  exactly  once  in  order. 


For  this  algorithm,  we  assume  every  node  hss  a fixed  set  of  copies.  Thie  assump- 
tion lets  us  concentrate  on  specifying  laay  updates.  Every  node  contains  pointers  to 
il.s  children.  Its  parent,  and  its  siblings.  When  a node  Is  crested,  its  set  of  copies  are 
also  created,  and  copies  of  the  node  are  never  destroyed. 

A search  operation  issues  a seardi  action  for  the  root.  The  search  action  is  a 
straightforward  translation  of  the  action  that  a shared-memory  B-link  tree  algorithm 
takes  at  a node.  An  insert  operation  searches  fitr  the  correct  leaf  using  search  nclions, 
then  performs  an  insert  action  on  the  leaf.  If  the  leaf  becomes  loo  full,  the  operation 
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rentnicturefl  tlie  by  isauing  balf-apUl  aod  iDsert  actioiu.  The  inaert  actioD 

adda  a aew  key  al  the  leaves  and  adds  a pointer  to  a child  in  the  nondear  nodes.  The 
half-split  action  creates  a new  sibling  (and  the  sibling's  copies),  transfers  keys  froni 
the  balf-splil  node  to  the  sibling,  modifies  the  node  to  ptnnt  to  the  sibling,  and  sends 

The  fint  step  in  designing  a distributed  algorithm  is  to  specify  the  commutativity 

1.  Any  two  insert  actions  on  a copy  coininiit«.  As  in  Sagiv's  algorithm  |49|,  we 
need  to  take  care  to  perform  out-oTorder  inserts  properly. 

2.  Haif.split  operations  do  not  commute.  Since  a halfeplit  action  modifies  the 
right'Sibling  pointer,  the  final  value  of  a copy  depends  on  the  order  in  which 
the  half'Splits  ace  processed. 

3.  Relayed  half-split  actions  commute  with  relayed  inserts,  but  not  with  performed 
initial  inserts.  Suppose  that  in  history  Hy,  initial  insert  action  I{A)  is  performed 
before  a half-split  action  s that  removes  A’s  range  from  p-  Then,  if  the  order 
of  / and  s are  switched,  I becomes  an  invalid  action,  A relayed  insert  actiou 
has  no  subsequent  actions,  and  the  final  value  of  the  node  is  the  same  in  either 
ordering.  Therefore,  relayed  half-splits  and  relayed  inserts  commute. 

4.  Initial  half-split  actions  don't  commute  with  relayed  insert  actions.  One  of  the 
subsequent  actions  of  an  initial  half-splll  action  is  lo  create  the  new  sibling. 
The  key  which  is  inserted  ritber  will  or  won’t  appear  in  the  sibling,  depending 
on  whether  it  occurs  before  or  after  the  half-split. 


clas«ificatioD  methods, 


is  a Isoy  update  and  a balf-spHt  is 


aeniieynchranous  update.  If  the  orderiof  between  haif-spiits  and  inserts  isn’t  main- 
tained, the  result  is  iost  updates  (sec  Figure  3.3).  We  next  present  two  aigorithms 


copy  (PC),  which  executes  aii  initiai  half-split  actions  (non-PC  copies  never  execute 
initial  half-split  actions,  only  relayed  half-splits).  The  algorithms  differ  in  how  the 
insert  and  half-split  actions  are  ordered.  Tile  synrbronous  algorithm  uses  the  or- 
der of  half-splits  and  inserts  at  the  primary  copy  as  the  standard  to  which  all  other 
copies  must  adliere.  The  seuii-.syncbronous  algorithm  requires  that  the  ordering  at 
the  primary  copy  be  consistent  with  the  ordering  at  all  other  nodes  (see  Figure  3.4). 

We  do  not  require  that  all  initial  Insert  actions  ace  j^erfomied  at  the  PC.  so  copies 
might  find  that  they  exceed  their  maximum  capacity.  However,  since  each  copy  is 
maintained  serially,  it  is  a simple  matter  to  add  overflow  blocks. 


Algorithm!  An  Operation  is  executed  by  submitting  an  action,  and  each  action 
generates  subsequent  actions  until  the  operation  is  completed.  An  operation  is  exe- 
cuted by  executing  its  B-link  tree  actions,  a.s  discussed  previously.  Thus,  oil  we  need 
to  do  is  specify  the  execution  of  an  action  at  atopy.  The  synchronous  split  algorithm 


to  manage  fixed-copy  nodes  . To  order  the  hnlf-spllts,  both  algorithms  use  a primary 
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Figure  3.3.  An  example  of  the  lost-insert  problem 


' ordered  the  same  way  at  the  PC 


uses  ao  AAS  to  ensure  that  splits  and  inserts  are 
and  at  the  non>PC  ropies  (see  Figure  3.4). 

Hulf-oplit  Only  the  PC  executes  initial  half-split  actions.  Non-PC  copies  execute 
relayed  half-split  actions.  When  the  PC  detects  that  it  must  half-split  the 
node,  it  does  the  following; 

1.  Performs  a split.atart  AAS  locally.  This  AAS  blocks  all  initial  insert  ac- 
tions, but  not  relayed  insert  or  search  aclious. 

2.  The  PC  sends  a split.slart  AAS  to  ail  of  the  other  copies. 

3-  The  PC  waits  for  ackoowledgments  from  all  of  the  copies  of  the  AAS. 

4.  When  the  PC  receives  all  of  the  acknowledgments,  it  performs  the  half- 
split,  creating  all  copies  of  the  new  sibling  and  sending  them  the  sibling’s 

3.  Tlic  PC  sends  a split  jtnd  AAS  to  all  copies,  and  performs  a split^nd  AAS 
on  itself. 

When  a non-PC  copy  recriveaasplitjlart  AAS,  it  blocks  Iheexecution  of  initial 
inserts  and  sends  an  acknowledgment  to  the  PC.  The  executions  of  further 
initial  insect  actions  on  the  copy  are  blocked  until  the  PC  sends  a split-eod 
AAS.  When  the  copy  processes  the  split.erd  AAS,  it  modifiee  the  radge  of  the 
copy,  and  the  right-sibliog  pointer,  discards  pointers  no  looger  in  the  node's 
range,  and  unblocks  the  initial  Insert  actions. 

htert  When  a copy  receive*  an  initial  insert  action  it  does  the  following; 

I.  Checks  to  see  if  the  insert  is  in  the  copy's  range.  If  not,  the  insert  action 
is  sent  to  the  right  sibling. 
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3.  If  the  insert  is  in  rnngc,  nnd  the  copy  is  performing  n split  AAS,  the  insert 
is  bloched;  otherwise, 

3.  The  insert  is  performed  and  relayed  insert  aetions  are  sent  to  all  of  the 

When  a copy  receives  a relayed  insert  action,  it  checks  to  sec  if  the  insert  is  in 
the  copy's  range.  If  so,  the  copy  {>erf(>nns  the  insert.  Otherwise,  the  action  is 
discarded. 

Search  When  a copy  receives  a search  action,  it  examines  the  node’s  current  state 
and  issues  the  appropriate  subsequent  action. 

We  note  that  since  non-PC  copies  can't  initiate  a half-split  action,  they  may  be 
required  to  perform  an  insert  on  a loo-full  node.  Actions  on  a copy  are  performed 
on  a single  processor,  so  it  is  not  a problem  to  attach  a temporary  overflow  bucket. 
The  PC  will  soon  detect  the  overflow  condition  and  issue  a half-split,  correcting  the 
problem. 

Theorem  I The  sr/nchronoets  eplU  algorithm  saltefies  the  complete,  compaltble,  and 
ordered  hialory  rtgttiremenU. 

Pwof:  We  observe  that  the  fourth  link-algorithm  guidelioe  is  satisfied,  so  that  wheo- 
ever  an  action  arrives  at  a copy,  its  parameter  is  within  the  copy's  inreach.  Therefore, 
the  synchronous  split  algorithm  satisfies  the  complete  history  requirement. 

Since  there  are  no  ordered  actions,  the  syochronous  split  slgorithm  vacuously 

We  show  that  the  synchronous  algorithm  produces  compatible  bistories  by  show- 
ing that  the  hislories  at  each  node  are  compatible  with  the  uniform  history  at  the 


primaiy  copy.  Firat.  coiuid«c  Iht!  ordenug  of  the  hiUFsplit  acliona  (a  haif*Aplit  is  per* 
fonned  at  a code  when  the  split  jtod  AAS  is  executed).  All  initial  half-split  actions 
arc  pcrfomied  at  the  PC,  then  are  relayed  to  the  other  copies.  Since  we  assume  that 
messages  are  received  in  the  order  sent,  all  half-splits  are  processed  in  the  same  order 

Consider  an  initial  insert  1 and  a relayed  half-split  s performed  at  noo-FC  copy 
c.  If  f < s in  He<  then  J must  have  been  performed  at  c before  the  AAS.atart  for 
s arrived  at  c (because  the  AAS,start  blocks  initial  inserts).  Therefore,  /’s  relayed 
insert  i must  have  been  sent  to  the  PC  before  the  acknowledgment  of  a was  sent.  By 
message  ordering,  i is  received  at  the  PC  before  S is  performed  at  the  PC,  so  i < 5 
in  If  s < / in  then  S < i in  because  5 < s and  / < i (due  to  message 
passing  causality).  □ 

We  note  that  this  algorithm  makes  good  use  of  taay  updates.  For  example,  only 

messages  instead  of  the  0(|copies(n)|)  messages  this  algorithm  uses.  Furthermore, 
search  actions  are  never  blocked. 

Semi-svnehronous  Splits 

blocks  insert  actions  and  requires  only  |copies(n)|  messages  per  split  (and  therefore 
is  optimal). 
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Algorithm:  Tliu  sgftii  synchroDous  split  algori thro  is  tbessozicssthcsynchronoas 

split  slgoritbm,  with  the  followiog  exceptions: 

1.  When  the  PC  detects  that  a split  needs  to  occur,  it  perfonns  the  ioitial  split 
(craates  the  copies  of  the  new  sibling,  etc.),  then  sends  relayed  split  actions  to 

2.  When  a non-PC  copy  receives  a relayed  split  action,  It  performs  the  relayed 

3.  If  the  PC  receives  a relayed  insert  and  the  iruert  is  not  in  the  range  of  the  PC, 
the  PC  creates  an  initial  insert  actioi:  and  sends  it  to  the  right  neighbor. 

Theorem  S The  semt-synehronoae  split  ulyorithrn  sollsyies  the  eompiele,  eonstsleal. 

Proof:  The  semi-synchronous  algorithm  can  be  shown  to  produce  complete  and  or- 
dered hi.storiesin  the  same  manner  as  in  the  proof  of  Theorem  1. 

We  need  to  show  that  all  copies  of  a node  have  compatible  histories.  Since  relayed 
inserts  and  relayed  splits  commute,  we  need  only  consider  the  cases  when  at  least 
one  of  the  actions  is  an  ioitial  action.  Suppose  that  copy  c perfonns  initial  insert  / 
after  relayed  split  a.  Then,  by  message  causality,  the  PC  has  already  performed  S, 
so  the  PC  will  perform  i after  S- 

Suppose  that  c performs  / before  .!  and  PC  performs  i alter  5.  If  i ia  in  the 
range  of  PC  after  5.  then  i can  be  moved  before  S ii:  Hpc  without  modifying  any 
other  actions.  If : is  no  longer  In  the  range  of  PC  after  S.  then  moving  i before  S in 
requires  that  5*5  subsequent  action  be  modified  to  include  sending  i to  the  new 
sibling.  This  is  exactly  the  action  the  algorithm  tak«.  □ 


m take  advaeUgc  of  the  semantics  of  the  insert  and 


split  ac 

la  the  next  section,  we  observe  a different  type  of  laxy  copy  management  which  also 
simplifies  implementation  and  improves  performance. 

3.5.2  Sinelf-CODV  Mobile  Nodes 

In  this  section,  vre  briefly  examine  lire  problem  of  lazy  node  mobility.  We  assume 
that  there  is  only  a single  copy  of  each  node,  but  that  the  nodes  of  the  B-trcc  can 
migrate  from  processor  to  processor  (typically,  to  perform  load-balanciog).  When 
a node  migrates,  the  host  processor  can  broadcast  its  new  location  to  every  other 
processor  that  manages  the  node  (as  is  done  in  Emerald  (29]).  However,  ibis  algo- 
rithm requires  large  amounts  of  wasted  effort  and  doesn't  solve  the  garbage  collection 
problems. 

Tbe  algorithms  we  propose  inform  the  node's  immediate  neighbors  of  the  new 
address,  fn  order  to  hod  tbe  neighbors,  a node  cootains  links  to  both  its  left  and 
right  sibling,  as  well  as  to  its  parent  and  its  children.  When  a node  migrates  to  a 
different  processor,  it  leaves  behind  a /oneonfing  address.  If  a message  arrives  for 
a node  that  has  migrated,  the  message  Is  routed  by  the  forwarding  address.  We 
are  left  with  the  problem  of  garbage-collecting  the  forwarding  addresses  (when  is  It 
safe  to  reclaim  the  space  used  by  a forwarding  address?)  As  with  the  Sxed-copiea 
scenario,  wc  propose  an  eager  and  a lazy  algorithm  to  satisfy  the  protocol.  We  have 
implemented  the  lazy  protocol,  and  found  it  effectively  supports  data  balancing  (30|. 

The  eager  algorithm  ensures  that  a forwarding  address  exists  until  the  processor 
is  guaranteed  that  no  message  will  arrive  for  it,  (Jiifortunately,  obtaining  such  a 
guarantee  is  complex  and  requires  much  message  passing  and  synchronization.  Wc 
omit  the  details  of  the  eager  algorithm  to  save  space. 
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Suppose  tbal  a node  migrates  and  doesn't  leave  behind  a forwarding  address.  If 
a message  arrives  for  the  oiigraled  node,  then  the  message  clearly  has  misnavigaled. 
This  situation  is  similar  to  the  misnavigaled  operations  io  the  concurrent  B*link 
protocol,  which  suggests  that  we  can  use  a similar  mechanism  to  recover  from  the 
error.  We  need  to  find  a pointer  to  follow.  If  the  processor  stores  a tree  node,  then 
that  node  contuns  the  first  link  on  the  path  to  the  correct  destination.  So  the  error- 
recovery  mechanism  is  to  find  a node  that  is  ‘close'  to  the  destination  and  Iblbw  that 
set  of  links. 

The  other  issue  to  address  is  the  ordering  of  the  actions  on  the  nodes  (since  there 
is  only  one  copy,  every  node  history  is  vacuously  compatible).  The  possible  actions 
are  tbc  following:  insert,  split,  migrate,  and  link-change.  The  link-change  actions  are 
a new  development  in  that  they  are  issued  from  an  caloroal  source,  and  need  to  be 
performed  in  the  order  issued. 

Algorithm:  livery  node  has  two  additional  identifiers,  a veraron  numher  and  a 

level.  The  version  number  allows  us  to  lazily  produce  ordered  historic.  The  level, 
which  indicates  the  distance  to  a leaf,  aids  in  recovery  from  misnavigation.  An 
operation  is  executed  by  executing  its  B-link  tree  actions,  so  we  only  need  to  specify 
the  execution  of  the  actions. 

Ou(-o/-ranjr.  When  a message  arrives  at  a node,  the  processor  first  checks  if  the 
node  is  in  range.  This  check  includes  testing  to  sec  if  the  node  level  and  the 
message  destination  level  match,  if  the  message  is  out  of  range  or  on  the  wrong 
level,  the  node  routes  it  in  the  appropriate  direction. 

Hfigmlion:  When  a node  migrates, 

I.  aJi  actions  on  the  node  are  blocked  until  the  migration  terminates. 
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2.  A duplicate  copy  of  the  node  is  roadc  on  a remote  processor,  (with  the 
exception  that  the  vershm  number  iocreases  by  1). 

3.  a link-change  action  is  sent  to  all  known  neighbors  of  the  node. 

4.  the  oripnaJ  node  is  deleted. 

fruert:  bserU  are  performed  locally. 

Half-splits  are  performed  locally  by  placing  the  sibling  on  the  same  pro- 
censor  and  aasigiiing  the  sibling  a version  oumberoDC  greater  than  the  half-split 
node's.  An  insert  action  is  sent  to  the  parent,  and  a bnk-chaage  action  is  sent 
to  the  right  neighbor. 

Unk-cIiitngK:  When  anode  receives  a link- change  action,  it  updates  tlie  indicated  link 
only  if  the  update's  version  number  is  greater  than  the  link's  current  version 

Musing  Sode:  If  a message  arrives  for  a node  at  a processor,  but  the  processor  doesn’t 
store  the  node,  the  processor  performs  the  out-of-range  action  at  a locally  stored 
node.  If  the  processor  doesn’t  store  a search  structure  node,  the  action  is  sent 
to  the  rtxit. 


Thtonm  $ The  lozyalgoriUm  satisfies  the  eemplele,  compatible,  and  ordered  history 
requirements. 

Proof:  There  is  only  a single  copy  of  anode,  so  the  histories  are  vacuously  compatible. 
Each  action  lakes  a good  slate  to  a good  stale,  so  every  action  eventually  finds  its 
destination.  Therefore,  the  algorithm  produces  complete  histories. 

The  only  ordered  actions  are  the  link-change  actions.  The  node  at  the  end  of  a 
link  can  only  change  duo  to  a split  or  a migration.  In  both  cases,  the  node’s  version 


number  i»  incremented.  Wlien  a link-change  action  arrives  at  the  correct  destination, 
it  is  performed  only  if  tbe  version  number  of  the  new  node  is  larger  than  the  version 
number  of  the  current  oodc.  If  the  update  is  not  performed,  the  node’s  history  is 
rewritten  to  insert  the  link  change  into  its  proper  place.  Let  f bo  a link-change  action 
that  is  not  performed,  and  let  f be  an  ordered  action  of  class  C LclOi  be  the  ordered 
actioD  of  class  C in  H,  that  is  ordered  Immediately  after  I (there  is  no  aj  such  that 
I <£  <c  “i)-  Wo  rewrite  ft  to  be  W' = /c  (nii'i  ^ (lltaf®*)-  Thus,  the 
history  can  be  rewritten  so  that  it  remains  valid.Q 

We  note  that  an  implementation  of  the  lasy  single-copy  algorithm  can  use  forward- 
ing address^  to  improve  efficiency  and  reduce  overhead.  The  forwarding  addresses 
aio  not  required  for  correctness,  so  they  can  be  garbage-collected  at  convraieot  In- 


in this  scenarin,  we  assume  that  leaf  level  nodes  can  migrate,  and  that  processors 
can  join  and  leave  the  replication  of  the  index  nodes  (so  we  can  use  this  algorithm  to 
implement  a never-roerge  dB-lree).  We  assume  that  the  leaf  nodes  are  not  replicated, 
and  that  the  PC  of  a node  neve:  changes. 

The  lasy  algorithm  that  we  propose  combines  elements  of  the  lasy  fixed-copy 
and  migrating-node  algorithms  by  using  laay  splits,  version  numbers,  and  message 
recovery. 

To  allow  for  data-balanclog,  we  lei  the  leaf  level  nodes  migrate.  The  leaf  level 
nodes  aren’t  replicated,  so  we  can  manage  them  with  the  lasy  algorithm  for  migratiug 
nodes  (section  3..V2).  We  want  to  maintain  the  dB-tree  property  that  if  a processor 
owns  a leaf  node,  it  has  a copy  of  every  node  oo  the  path  from  the  root  to  the  leaf.  If 
a node  obtains  a new  leaf  node,  it  must  join  the  set  of  copies  for  every  oode  from  the 


root  to  tbe  leaf  which  it  does  not  already  help  mainlaio.  If  the  processor  sends  off  the 
last  child  of  a node,  it  unjoins  the  set  of  processors  which  maintaio  the  parent  (applied 
racursively).  When  a processor  joins  or  unjoins  a node  replication,  the  neifhltoring 
nodes  are  informed  of  the  new  cooperating  processor  with  a linic-ebange  action.  To 
facilitate  llnk-chango  actioos,  we  require  that  a node  have  pointers  to  both  its  left 
and  right  sildiog.  Therefore,  a split  action  generates  a link-chaoge  subsequent  action 
for  the  right  sibling,  as  well  as  an  insert  action  for  the  parent. 

Wc  assume  that  every  node  has  a PC  that  never  changes  (we  can  relax  this 
assumption).  The  primary  copy  is  responsible  for  performing  all  initial  split  actions 
for  registering  all  join  and  unjmo  actions.  The  join  and  unjoin  actions  are  analogous 
to  the  migrate  actions.  Hence,every  join  or  unjoin  registration  increments  the  veision 
number  of  the  node.  The  versiem  number  permits  the  correct  execution  of  ordered 
actions,  and  also  heliw  ensure  that  copies  which  jmn  a replication  obtain  a complete 
history  (see  Figure  3.5).  When  a processor  unjoins  a replication,  it  will  ignore  all 
relayed  actions  on  that  node  and  perform  error  recovery  on  all  initial  action  requests. 


Oui-oJ-raugf:  If  a ct^y  receives  an  initial  action  that  is  out-of-range,  the  copy  sends 
the  action  across  the  appropriate  link-  Relayed  actions  that  are  out  of  range 
are  discarded. 

Insert:  1.  When  a copy  receives  an  initial  insert  action,  it  performs  the  insert  and 

sends  relayed-insert  actions  to  the  other  node  copies  that  it  is  aware  of. 
The  copy  attaches  its  version  numlier  to  the  update. 

2.  When  a non-PC  copy  reenves  a relayed  insert,  it  performs  the  insert  if  it 
is  in  range,  and  discards  it  otherwise. 


3.  WbeD  the  PC  receives  & relayed  insert  action,  it  teats  to  see  if  the  relayed 
insert  action  is  in  range. 

(a)  If  the  insert  is  in  range,  the  PC  performs  the  insert.  The  PC  then 
relays  the  insert  action  to  all  copies  that  joined  the  replication  at  a 
later  version  than  the  vcmion  attached  to  the  relayed  update. 

(b)  It  the  insert  Is  not  in  range,  the  PC  sends  an  initial  insert  action  to 
the  appropriate  neighbor. 

Split:  1.  When  the  PC  detects  that  its  copy  is  too  foil,  it  perfnnns  a balf.split 

action  by  creating  a new  sibling  on  several  processors,  designating  one  of 
them  to  be  the  PC,  and  transferring  half  of  its  keys  to  the  copies  of  the 
new  sibling.  The  PC  sets  the  starting  version  number  of  the  new  .sibling 
to  be  its  own  version  number  pins  one.  rinally,  the  PC  sends  an  insert 
action  to  the  parent,  a link-rhange  action  to  the  PC  of  its  old  right  sibling, 
and  relayed-split  actions  to  the  other  copies. 

2,  When  a non-PC  copy  receives  a relayed  half-split  aclioo,  it  performs  the 
half-split  locally. 

Jain:  Wheo  a processor  joins  a replication  of  a copy,  it  sends  a join  action  to  the  PC 
of  the  node.  The  PC  increments  the  version  number  of  the  node  and  sends  a 
copy  to  the  requester.  The  PC  then  informs  every  processor  in  the  replication 
of  the  new  member  and  performs  a link-change  action  on  all  of  its  neighbors. 

Unjoin:  When  a processor  unjoios  a replication  of  a node,  it  sends  an  unjoin  action  to 
the  PC  and  deletes  its  copy.  The  processor  discards  relayed  actions  on  the  node 
and  performs  error  recovery  on  the  initial  actions.  When  the  PC  reeeves  the 


tion,  it  removes  the  pr 


' from  the  list  of  eopies,  relays  the  i 


to  the  other  copies,  and  performs  a liDk'change  action  on  all  of  its  neighbors. 
Rtlayaijoin/iinjoin:  When  a non-PC  copy  receives  a join  or  ao  enjoin  action,  it 

Lin^chonpe;  A link-change  action  is  eseculed  using  the  migrating-node  algorithm, 
jt/issinp-norfr.'  When  a processor  receives  an  initial  action  for  a node  it  doesn’t  man- 
age,  it  submits  the  action  to  a ’close’  node,  or  returns  the  action  to  the  sender. 
Theorem  J The  vanable-eopiea  algorithm  aotiafieo  the  complete,  rompsffife,  ond  or- 

Proof:  We  can  show  that  Uie  variable.copiea  algorithm  produces  complete  and  ordered 
histories  by  using  the  proof  of  Theorem  3.  If  we  can  show  that  for  every  node  n, 
the  history  of  every  copy  c € coptesfn)  has  a backwards  exteasion  W'  whoso  uniform 
update  actions  are  exactly  M„,  then  the  proof  of  theorem  2 shows  that  the  variable 
copies  algorithm  produces  compatible  histories. 

For  a nodes  with  primary  copy  PC,  let  A,  be  the  set  of  update  actions  performed 
on  PC  when  the  PC  has  version  number  i.  When  copy  c is  created,  the  PC  updates  its 
version  number  toy  and  gives  can  initial  value  !,  = I„D„  where  B,  is  the  backwards 
extension  of  to  /,  and  contains  all  uniform  update  actions  io  Aj  through  Aj-i- 
The  PC  next  informs  all  other  copies  of  the  new  mplication  member-  After  a copy  o' 
is  informed  of  c.  o’  will  send  all  of  its  updates  to  c.  The  copy  p might  perform  some 
initial  updates  concurrent  with  c’s  joining  copiesfri).  These  concurrent  updates  are 
detected  by  the  PC  by  the  version  number  algorithm  and  are  relayed  to  c.  Therefore, 
at  the  end  of  a computation,  every  copy  c € copies(ri)  has  every  update  m M,  in  its 
uniform  history.  Thus,  the  variable  copies  algorithm  produces  compatible  histories.Q 
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[□  tbia  chapter,  we  have  diaruased  the  rollowing: 

• Replication  Algorithma 

• Loay  Updetea  on  a dB'trec 

• Correctneas  theory  for  Loay  Updates 

Wc  present  algorithms  for  implementing  laay  updates  on  a dB-tree,  a distributed 
B-tree-  The  algorithms  can  be  used  to  implement  a dB-tree  which  never  merges 
empty  nodes  and  performs  data-balancing  on  the  leaves  (we  have  previously  foitiid 
that  the  £ree-at-emply  policy  provides  good  space  utilisation  (21|  and  that  leaf-level 
data  balancing  is  effective  and  low-overhead  [M]).  We  provide  a correctnss  theory 
for  lasy  updates,  so  lasy  update  techniques  con  be  used  to  implement  laay  updates  on 
other  distributed  and  replicated  search  structures  [U].  Laay  updates,  like  laay  repli- 
cation. permit  the  efficient  maintenance  of  the  replicated  index  oodes.  Since  little 
syochroniaation  la  required,  lazy  updates  permit  concurrent  search  and  modilicatioD 
of  a node,  and  even  concurreot  modification  of  a node.  Finally,  distributed  search 
structures  which  use  lazy  updates  are  easier  to  implomenl  than  more  restrictive  al- 
gorithms because  lazy  updates  avoid  the  use  of  synchronization.  The  next  chapter 
presents  the  details  of  our  implemeotation  of  the  distributed  B-tree, 


CHAPTER  4 
IMPLEMENTATION 


A difitnbuted  enviromnenl  consiatn  of  processors  capable  of  commuoicatiog  with 
each  other  through  messages.  We  implemented  the  distributed  B-tree  ou  a general 
oetwork  arcbitcclure,  a LAN  network  comprised  of  SPARC  workstations.  Every  pro- 
cessor is  capable  of  communicating  with  other  processors  and  has  sulScieot  amount 
of  local  storage.  Each  processor  acts  as  a server  responding  to  messages  from  other 


The  B-lree  is  distributed  by  partitioning  the  nodes  of  the  tree  across  a network  of 
processors.  The  network  of  processors  communicate  by  sockets  (a  Unix  iotemelwork 
message  passing  scheme),  lb  provide  a user  interface,  we  integrated  Xwindowa  In 
out  design.  In  this  design,  there  Is  an  overall  B-tree  manager,  called  the  anchor, 
which  overlooks  all  the  B-tree  operations.  The  anchor  is  reeponsible  for  creating 
new  proreases  on  different  processors  when  oecessary.  Every  processor  is  individually 
cespon^ble  for  the  nodes  it  maiotains. 

On  each  processor  we  have  a jacae  monajtcanda  node  mnnoyer.  The  queue  man- 


node  manager  takes  messages  from  the  queue  and  perfonns  the  operations  (specified 
in  the  message}  on  the  various  nodes  at  that  processor.  This  distinction  of  process 


61 


Figurr4.l.  Tbe  Communic»lion  Channels 
runctiunality  into  the  queue  manager  and  the  node  managef  enables  the  node  man* 
ager  to  be  independent  of  the  inter-processor  communication  method.  The  queue 
manager  and  the  node  manager  at  a proccseor  communicate  via  the  inter-procesa 
communication  schemes  supported  by  UNfX.  namely  message  queues  (Figure  4.1], 


The  anchor  is  responsible  for  initialining  the  B-tree.  In  addition,  the  anchor  re- 
ceives update  messages  from  exleruai  applications  and  sends  them  to  the  appropriate 
processor.  Each  processor  is  responsible  for  the  decision  it  makes  conceroiog  the  tree 
structure  it  holds-  Id  the  current  implementation,  the  anchor  niakes  the  decision 
if  two  or  more  processors  arc  Involved.  In  order  to  do  so,  the  anchor  must  have  a 
picture  of  the  global  state  of  the  system.  The  B-tree  processing  will  continue  while 
the  anchor  makes  its  decision,  so  the  gbbal  picture  will  usually  be  somewhat  out  of 
date.  Oiir  algorithms  take  this  fact  into  account. 
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The  anchor  begins  building  the  tree  by  selecting  a processor  (the  root  processor) 
to  hold  tile  root  of  tiie  tree.  The  node  manager  at  the  mnt  processor  has  a socket 
connection  to  the  anchor.  Update  operations  are  passed  to  the  root  processor  and 
percoiate  down  to  the  ieaf  level,  where  the  decisive  action  of  the  operation  is  per- 
formed. The  dashed  lines  in  figure  4,1  represent  temporary  communication  channels 
established  between  two  processors  for  the  transfer  of  nodes,  which  will  be  described 

4.2,2  Node  Structure 

Logically  adjacent  nodes  may  not  reside  at  the  same  processor;  hence,  a par- 
ent/child/sibliog  pointer  may  refer  to  a node  at  some  other  processor.  Also,  nodes 
cannot  be  niiiquety  identified  by  the  local  address.  Every  node  in  the  B-tree  has 
a name  associated  with  it  that  is  not  dependent  on  the  location  of  the  node.  This 
ntechanism  for  naming  nodes  is  known  as  location-independent  naming  of  nodes.  A 
typical  node  would  liave  a parent  pointer,  the  children  pointers,  and  the  sibling  point- 
ers. In  addition  to  having  the  liiglicsl  value  in  itself,  the  node  must  also  keep  the  high 
value  of  it’s  logical  neighbors,  'rhis  will  enable  a node  to  determine  if  the  operation 
is  meant  for  itself  or  destined  for  either  of  its  siblings. 

• Location  Independent  Naming  of  a Node 
Whenever  a node  is  created,  it  is  pveo  a name  that  is  unique  among  all  pro- 
cessors. For  iostaocc,  the  node  oame  may  be  a combination  of  the  processor 
□umber  that  creates  it  and  the  node  identifier  within  the  processor.  A hashing 
mechanisra  is  used  to  translate  between  node  names  and  physical  node  ad- 
dresses. When  a node  bob  moves  from  processor  A to  processor  fl.  it  retains 
its  name.  The  advantage  of  this  mode  of  naming  nodes  is  that  a |>arent,  child 


or  sibling  node  th&t  references  Ihe  node  bob  need  not  know  the  exact  address 
of  bob  in  processor  B. 

A further  arlvantsge  of  location  independent  naming  is  when  the  nodes  are 
replicated.  All  copies  of  a node  on  different  processors  have  the  same  name. 
So,  the  primary  and  secondary  copies  of  a node  can  keep  track  of  each  other 


Our  implemeolatioQ  is  primarily  concerned  with  Ihc  update  operatious;  inserts 

An  insert  operation  at  a leaf  processor  inserts  tbe  key  in  the  appropriate  node, 
say  n.  If  the  insertion  of  a key  causes  the  node,  n to  became  too  full,  the  oode 
splits  by  creating  a new  sibling,  » and  moves  half  the  keys  from  the  original 
node,  « to  tbe  new  sibling  s.  The  parent  node  p of  node  n Is  informed  of 
Ibis  split  by  sendiog  a message  to  the  processor  the  parent  it  resides  oo.  The 
message  contains  the  name  of  the  new  sibling  s and  tbe  modified  high  and  low 
values  of  n and  s.  Tb  improve  parallelism  and  reduce  the  number  of  messages 
in  the  system,  tbe  child  processor  does  not  wait  for  an  acknowledgement  from 
the  parent  processor;  instead,  we  use  the  B-link  tree  protocol. 

When  the  [larent  node,  p receives  o split  message  from  the  child,  it  adds  the 
node  s as  a new  child,  and  adjusU  the  high  and  low  values  of  its  children  n 
and  a If  tlie  addition  of  the  child  s causes  tbe  parent  node  p to  become  full, 
it  splits  into  p and  up.  The  keys  transfer  takes  place  as  at  the  lower  level  and 
a split  message  from  tbe  parent  p travels  to  its  parent  gp  and  tbe  process  may 


recurse  upward  till  the  root.  The  children  cd  p are  not  informed  immcdiatciyof 

pointer  to  up.  Our  design  can  tolerate  the  “Iday”  update  of  these  pointers  since 
a message  from  the  child  s to  the  old  parent  p,  will  find  the  ctercct  new  parent, 

If  an  insert  causes  the  parent  to  split,  the  message  percolates  up  towards  the 
root  node.  In  the  event  that  the  root  node  splits,  a new  root  lias  to  be  created, 

increased. 

The  delete  operations  pose  more  complications,  as  deletion  of  heys  means  shift* 
ing  the  responsibility  of  a key  range  between  two  nodes.  A delete  operation 
removes  the  key  from  a leaf  level  node,  e.g.,  node  n.  The  restructuring  actions 

or  merpS'Ol'haf/B-ttee.  Merging  across  processors  involves  too  much  overhead 
in  terms  of  synchronisation  and  messages,  and  thus  is  not  cost  efficient.  So, 
if  the  neighbors  are  on  the  same  processor,  then  the  merge  at  half  protocol 
is  used;  otlierwlse,  tlic  node  is  allowed  to  become  empty  (i.e*.  frce-at*empty 


protocol  is 


The  problem  that  occura  when  nodea  con  split  as  well  as  merge  is  that  some 
actions  can  be  performerl  twice  at  some  copies,  leading  to  inconsislency.  This 
transpires  when  an  action  occurs  at  the  PC  befrtre  the  split  and  at  a doh*PC 
copy  after  the  merge. 

When  the  hey  ranges  ol  interior  nodes  change  due  to  merging,  then  care  must 
be  taken  to  synchronise  the  inserts  or  deletes  with  the  splits  and  merges.  Let 
us  consider  this  scenario:  Suppose  there  are  three  copies  of  a oode,  cl,  c£.  and 
PC  (Figure  4.2).  Let  the  initial  insert  of  key  k,  I(k)  be  performed  at  cl.  The 
relayed  insert  i(k)  is  relayed  to  c8  and  the  PC-  Before  the  relayed  insert  i(k) 
reaches  the  PC,  the  node,  n has  split  into  n’  and  a The  relayed  insert  ifi)  at 
the  PC  is  forwarded  to  the  siblings  as  l'(k)uii  isiierfonned  there.  This  is  now 
relayed  as  i'(k)  to  the  copies  cl  and  eg  The  copy  c*  performs  i’(k)  on  ».  Now 
suppose  Dfk)  is  performed  on  s at  PC.  Subsequently,  relayed  deletes  dfk)  are 
performed  on  s at  copies  el  and  eg  Let  the  codes  n’  ands  merge  now  to  form 
s'and  s’(wheren"contains  the  range  of  k).  Now,  the  relayed  insert  ifle)  (from 
copy  cl)  arrives  at  eg  and  k is  inserted  in  ii",  losing  the  action  JfkJ.  The  copies 
now  of  n”  at  cl  and  the  PC  do  not  have  the  key  k,  but  that  at  eg  rmDtains 
the  key  k.  Thus,  the  key  k is  inserted  twice  and  never  deleted  from  eg.  (If  the 
action  ifk^  had  arrived  at  eg  before  the  merge,  then  the  node  n'for  which  it  was 
intended  would  not  coulaiu  the  range  and  hence  would  be  discorded,  leaving 
all  copies  consisleol.} 


I.  FVee-at-empty: 

A node  n that  becomes  empty  does  not  get  deleted  until  its  neighbors 
updnie  their  links.  A processor  that  receives  a sibling  empty  message 
blocks  deletes  and  sends  an  acknowledgement  after  it  has  set  the  link. 


66 


Fi^re  4.2.  Duplic&tc  actions  due  to  merges 
After  the  ackoowlcdgcmcnts  are  received  from  both  neighbors,  the  space 
is  freed.  The  node  pointer  must  also  be  deleted  from  the  parent.  Amessage 
is  sent  to  the  parent  node  and  n is  marked  as  deleted.  However,  the  node 
remains  in  the  doubly* linked  list  with  its  siblings  until  an  acknowledgement 
arrives  from  the  parent.  This  ensures  that  no  further  updates  to  the  node 
n will  be  received;  so  a is  removed  from  the  list  and  its  space  is  reclaimed. 
In  the  interval  before  the  acknowledgement  is  received,  any  operations  to 
the  deleted  notle  n are  sent  to  its  siblings  (as  appropriate).  If  a node  is 
asked  to  delete  a pointer  that  does  not  exist,  (as  the  relayed  insert  has 
not  yet  arrived  at  that  copy)  but  is  in  its  key  range,  the  delete  action  is 
delayed  until  the  corresponding  insert  action  arrives.  Thus,  a node  has  to 
remember  dtlaycd  defefes. 

2-  Merge-at-half: 

In  addition  to  deleting  nodes  that  are  empty,  we  have  iocorporaled  a mers^e 
protocol  to  implement  merge-nt-half.  If  the  removal  of  a key  reduces  n to 
less  than  half  its  maximum  capacity,  the  node  shares  its  keys  either  to  the 
right  or  to  the  left.  The  idea  here  is  to  keep  the  nodes  equally  full.  If  the 


ri|hl  or  left  neighbor  ban  more  than  half  the  keya.  the  excess  is  shared 
reith  the  code  n. 

The  transfer  of  keys  between  two  adjacent  leaves,  must  be  recorded  at  the 
parent.  The  parent  is  made  aware  of  the  key  range  in  its  child  subtree  so  that 
future  updates  would  be  directed  properly. 

pointer  to  the  child.  On  reemving  a change  in  the  key  range  message  from  a child, 

one  of  the  above  situations  so  the  algorithm  is  applied  recur»vely  ( 4.3).  If  a delete 

one  of  them  is  deleted  and  it  Is  left  with  one  child.  A message  Is  sent  to  the  anchor 
to  shrink  the  tree.  The  anchor  makes  the  ooly  child  t>f  the  root  the  oew  root  of  the 
shorter  tree.  It  also  removes  the  old  root  node  and  deallocates  the  processor  holding 
the  old  root  node. 

To  obtain  a better  understanding  of  how  these  protocols  work,  let  us  lotrk  at  the 
algorithms  in  figures  4.3  through  4.6.  When  a processor  receives  a delete  message, 
the  message  travels  to  the  appropriate  leaf  ntxle  and  then  the  procedure  4.3  is 
invoked.  In  this  algorithm,  the  key  v is  deleted  from  the  node  n that  resides  on 
processor  p.  The  contents  of  node  n have  changed,  so  the  state  of  node  » is  decided 
by  iovoldng  the  algorithm  decide.state  [figure  4.4).  Procedure  decide^tete  roay 
return  any  of  the  following  values:  ( 4.4) 

• INITVAL:  In  this  case  the  root  node  hss  been  reachml  and  so  the  delete  process 
is  completed.  A relayetLdelete  message  is  sent  lo  all  copies  of  the  node. 


EMPTY-LOCAL;  The  parent  of  node 


node  n.  so  the  paieot  is  updated  of  the  key  deletion  in  node  n,  and  the  process 
coutinues  recursively  upwards. 

EMPTY  -REMOTE:  The  parent  of  node  a does  not  reside  oo  the  same  procee- 
sor,  so  a message  is  sent  to  the  processor  holding  the  parent  node,  indicating 
that  the  node  n has  become  empty  and  to  remove  the  child  pointer  to  n. 
MERGE.RJGHT:  The  right  neightror  of  node  n resides  on  the  same  processor, 
p,  so  the  nodes  n and  its  tight  neighbor  share  the  keys  among  themselves  4.d. 
MERGE.LEFT:  The  left  neighbor  of  the  node  n resides  on  the  same  processor, 
p,  so  the  nodes  n and  its  left  neighbor  share  the  keys  among  themselves  4.5. 
NO-MERCE:  If  the  node  n is  oeither  empty  nor  less  than  half-full,  then  a merge 
cannot  be  done.  So  the  siblings  of  the  node  are  updated  and  the  parent  of  the 
code  n is  updated  of  the  new  high  and  low  values  of  the  code  n,  if  the  parent 
resides  on  the  same  prrtcessor.  Otherwise,  a message  is  sent  to  the  parent  on 
some  other  processor  to  update  node  n’s  values.  A re/apcd-dcfcle  message  is 
sent  to  all  copies  of  the  node. 
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Figure  4.3.  Recursivr-Delcte  Algorilhr 
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1 (tMpt;}  { 

It  (a«it->parut^rgc  («  CUAJ'Bac) 


Figure  4.5.  Vrocedure  Performjnergejight  for  Deletes 
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Figure  4.6.  Procedure  Perform jnergeJeft  for  Deletes 


We  liftve  addressed  tbc  need  for  distributing  tbe  B-tree.  Distributing  the  tree 
arbitrarily  implies  that  some  processors  may  have  many  nodes  (due  to  splits).  Hence, 
when  there  are  plenty  of  nodes  in  the  system,  the  processors  run  the  risk  of  losing 
storage  capacity.  Hence,  for  efficient  use  of  storage  and  other  resources,  it  is  necessary 
to  balance  the  load  among  processors.  We  will  be  discussing  the  various  algorithms 
for  data  balancing  in  the  next  chapter.  However,  certain  inherent  issues,  such  as 
methods  for  dealing  with  out-of-order  messages  caused  by  delays  introduced  by  tbc 
underlying  network,  and  low-overhead  synchronisation  of  tree  restructuring,  will  be 
discussed  here.  Methods  for  node  mobility  essential  for  data  balancing  and  resource 
sharing  are  also  discussed.  We  have  developed  algorithms  for  dynamic  data-load 
balancing  that  use  tbe  mechanisms  ol  node  mobility.  In  this  chapter,  we  present 
some  issues  pertaining  to  this  balancing.  Other  issues  that  arise  from  load  balancing 
ace  mechanisms  for  node  luobillly  and  out-of-order  information  handling. 

Tbe  fundamental  issue  in  load  balancing  is  the  actual  process  of  moving  a node 
between  processors.  This  is  termed  the  nor/e  mijmlion  mechaoism  and  is  common 
to  all  of  our  algorithms  for  load  balanriug. 

Another  important  concern  is  the  out-of-date  infonnation  that  the  processors 
have.  Since  processors  do  not  have  up-to-date  information  about  every  other  proces- 
sor in  tbe  system,  they  must  rely  on  the  old  information  to  make  decisions.  When 

selects  a reesving  processor  (bow  this  is  done  will  be  explained  in  the  next  chapter) 
and  follows  a nfgotiatioH  protocol  to  determine  the  exact  tbe  number  of  nodes  to 
transfer.  This  will  be  discussed  in  greater  detail  in  section  4.4. 

The  node  migration  algorithm  should  address  the  following  questions; 
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1 Who  is  involved?  Should  the  sender  and  all  other  processors  be  locked  up 
until  all  pointers  to  the  node  in  transit  get  updated?  If  this  is  so,  parallelism 
would  be  lost.  How  should  we  achieve  ma.ximum  parallelism? 

2.  When  is  everyone  informed?  Once  a node  has  been  selected  for  migration, 
how  and  when  la  every  other  processor  informed  of  its  new  address?  In  the 
interim  in  which  node  movement  takes  place  and  the  other  related  processors 
are  informed,  what  happens  to  the  updates  that  come  for  the  node  io  transit? 


3.  How  is  Obsolete  information  handled?  When  a node  moves  it  sends  an 
update  link  message  to  related  processors.  Suppose  the  update  message  for  link 
change  gets  delayed  and  the  node  moves  for  a second  time.  The  second  update 
message  may  reach  a reach  a processor  before  the  first  one,  what  approach 
should  one  take  to  resolve  this  problem. 


In  the  context  of  node  mobility,  object  mobility  has  been  proposed  in  Emerald  [2fij. 
Emerald  is  an  object-based  language  which  places  emphasis  on  the  mobility  of  objects. 
Objects  in  Emerald  can  be  data  objects  or  process  objects  and  the  distribution  is 
adaptive  to  dynamically  chan^ng  loads.  Here,  every  object  has  a forwarding  address 
comprised  of  a timestamp  and  address.  Every  time  the  abject  moves,  the  address 
and  the  timestamp  is  updated.  If  an  objert  moves  from  node  A to  node  R,  only 
node  A and  node  B are  updated.  When  node  C nddresses  the  object  at  uode  A,  the 
message  is  forwarded  to  node  B.  Finally,  node  B responds  to  the  message  and  sends 
the  message  back  to  C with  its  new  address  piggybacked.  Objects  keep  forwarding 
information  even  after  they  have  moved  to  another  node  and  use  a broadcast  protocol 
if  no  forwarding  information  is  available. 


4.3.1  Nodf  Miittlion  Almritlun 


For  the  following  diecueeion  on  the  node  migretion  mech&nisni,  let  us  assume  the 

recipient  processor  that  is  willing  to  accept  nodes.  The  actual  method  by  which  this 
is  done  will  be  explained  in  a later  section. 

After  the  node  manager  is  informed  of  a recipient  for  its  excess  nodes,  it  must 

Who  is  involved?  Our  solution  to  the  first  problem  is  aimed  at  maintaining  par* 

name  when  moved  between  processors,  there  is  no  need  for  ackoowledgments  in  our 
algorithm.  After  the  node  selection  is  done,  the  sending  processor  (henceforth  called 
the  sender)  establishes  a communication  channel  with  the  neceicer  and  a negotiation 
protocol  follows.  Id  the  negotiation  protocoi,  the  sender  and  the  receiver  cr>me  to  an 
agreement  as  to  how  many  nodes  are  to  he  transferred.  After  a decision  has  been 
reached  the  sender  sends  a node,  updates  the  forwarding  information  in  the  node  and 
transfers  the  next  ntxle.  A node  that  has  itcen  sent  is  lagged  as  in  transit  and  no 
operations  are  performed  on  that  node  at  the  seoder  (Aig.  2). 

When  is  everyone  informed?  The  sender  and  receiver  update  ali  iocaiiy  stored 
pointers  to  the  transferred  nodes.  If  the  related  nodes  are  oo  different  processors 
(other  than  the  sender  and  receiver),  the  receiver  sends  link  update  mess^es  to 
them.  If  in  the  meantime  some  messages  arrive  for  this  node  at  the  sender  (since  all 
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processors  are  noL  yet  aware  of  tbe  migratioo),  the  messages  are  forwarded  to  the 
□ew  address.  At  specific  intervals  of  time  the  nodes  marked  in  transit  are  deleted 
and  their  storage  reclaimed. 

Since  we  do  not  require  acknowledgments  for  tbe  link  changes,  it  is  possible  that 
a message  will  arrive  tor  the  deleted  node.  In  this  case,  the  node  manager  forwards 
the  message  to  a local  node  that  Is  '‘close"  to  the  intended  address.  The  message 
then  follows  the  B-link-lree  search  protocol  to  reach  its  destination-  In  our  current 
implementation,  vre  arc  guaranteed  that  the  processor  stores  either  a parent  of  the 
deleted  node,  or  another  node  oo  the  same  level  as  the  deleted  node.  The  significance 
of  this  deleted  node  recovery  protocol  is  tliat  we  can  lazily  inform  neighboring  nodes 
of  a moved  node’s  new  address.  This  protocol  is  rarely  invoked,  since  most  messages 
for  the  transferred  node  are  handled  by  tbe  forwarding  address. 

How  is  Obsolete  information  handled?  The  final  problem  is  bow  to  deal  with 
out-of-order  messages  arriving  at  a processor.  In  any  network,  one  cannot  guarantee 
the  messages  are  delivered  in  the  same  order  as  sent.  The  inherent  delays  in  the 
underlying  network  cause  messages  to  be  sent  out-of-order  and  sometimes  even  be 
lost.  Messages  from  a single  source  to  a single  destination  arrive  in  the  same  order 
sent.  However,  there  is  no  order  imposed  on  messages  from  multiple  sources  to  a 
dcslinatioD.  The  question  then  is  how  doe*  the  system  tolerate  delayed  and  even 
lost  messages.  The  above  problem  can  be  translated  to  our  B-trce  as  shown  by  tbe 
following  example  (Figure  4.7). 

Suppose  node  a moves  from  processor  A to  Processor  B.  Consider  node  p,  which 
resides  on  processor  P and  contains  a link  to  node  a.  When  node  a moves  to  processor 
B,  an  update  message  is  sent  to  node  p at  processor  P.  Before  this  message  reaches  P, 

to  P.  Suppose  the  message  from  C reaches  P before  the  message  from  B-  If  node  p 
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at  P updatas  the  node  address  to  that  of  C and  then  to  B,  then  oode  p at  P has  the 
wrong  address  for  node  o. 

a version  number  1 when  it  gels  created  for  the  hrst  time  (unless  it  is  the  result  of  a 
spilt).  Every  time  a node  moves,  its  version  number  is  incremented,  and  when  a node 
splits,  the  sibling  gels  a version  number  one  greater  than  that  of  the  original  node. 
Every  pointer  has  a version  number  attached  and  each  link-update  message  contains 
the  version  of  the  sending  node.  When  node  r receives  a link-update  message  from  s, 
r will  update  the  link  only  if  sh  version  number  is  equal  to  or  greater  than  the  link 
version  number.  In  the  above  example,  the  version  number  of  node  a on  prrrcessor 
A is  initially  1.  On  moving  to  prrKcssor  B.  the  version  number  changes  to  2.  The 
update  message  to  P frerm  B contains  the  version  number  2.  The  next  update  message 


sent  to  P from  C has  the  version  number  3.  Now  since  this  last  message  reaches  P 
first,  node  p at  processor  P notes  that  its  version  number  for  node  a is  1.  Since  3 >— 

message  from  B that  cmiluns  version  number  2 arrives.  But  now  oode  p has  versitm 
number  3 ferr  uode  a,  hence  the  version  numbers  do  oot  match  and  the  message  is 
ignored.  So  delayed  messages  that  arrive  out  of  order  at  a processor  are  ignored. 

Our  numbering  handles  out-of-order  link  changes  due  to  split  actions  also.  Reli- 
able communications  guarantee  that  messages  generated  at  a processor  for  the  same 
destination  arrive  in  the  rsrder  generated,  and  when  uurles  move  to  different  pro- 

rliangcs  ate  processed  in  the  order  generated. 


NwfotiaUon  Frotc 


In  all  our  algorithms,  no  extra  messages  are  sent  to  inform  the  other  processors 
of  a change  in  the  current  status  of  a jtrocesaor.  Thus,  if  the  number  of  nodes  at 
a processor  increases  or  decreases  due  to  splits  or  merges,  other  processors  are  not 
aware  of  it.  Neither  is  the  anchor  process  informed  about  these  changes,  the  reason 
being  to  avoid  excess  network  traffic.  However,  not  informing  others  leads  to  stale 
information,  where  the  anchor  and  the  processors  have  old  and  outdated  information 
about  other  processors.  Now,  in  the  load-balanciug  algoritlim  when  the  anchor  has 
to  decide  with  whom  an  overloaded  processor  must  share  its  data,  it  finds  another 
processor  based  on  the  outdated  informatioo.  We  will  show  in  the  next  chapter  that 
our  load  balancing  algorithms  perform  very  well,  in  spite  of  old  information  because 
of  the  negotiation  protoaL 


We  have  designed  an  atomic  handshake  pralocot  for  the  negotiation.  During  the 

processor  is  chosen,  either  by  the  anchor  (in  centralised  loadbalancing)  or  by  itself 
(in  distributed  load  balancing  with  probing),  the  status  of  both  processors  may  have 
changed.  So,  after  the  receiving  processor  is  selected,  theaeriiferand  the  receiver  enter 
into  negotiation  wherein  they  update  the  status  of  each  other  and  deride  exactly  bow 
many  nodes  to  share.  The  negotiation  involves  only  these  two  processors  and  hence 
other  processors  ore  not  hindered.  Once  negotiation  is  completed,  node  transfer  takes 
place.  It  should  be  noted  that  no  messages  are  sent  to  other  processors  informing 
them  of  the  negotiation  or  change  in  the  status  of  the  sender  and  the  receiver. 


4,5  Portability 

Finally,  we  have  ported  our  implementation  to  the  KSR,  a shared  memory  mul- 
tiprocessor machine  with  96  processors  that  supports  message  passing  by  providing 


BSD  sockets  (|31]).  The  portlug  of  our  impleineot&tioD  shows  that  our  systems  is 
portable  and  easily  scalable  to  a large  uiiinber  of  processors. 

■!.(!  Conclusion 

To  conclude,  this  chapter  has  addressed  the  following: 

e Data  balancing  the  dB-tree  and  the  ruitdamental  protocols  necessary 

• Portability  of  the  Implementation 

In  this  chapter  we  have  discussed  the  implementation  of  the  distributed  B-treeon 
a network  of  Sparc  stations  and  the  processes  needed  to  maoage  tliedB-tree.  Update 
opemtioos,  insert,  search  and  delete  are  performed  on  the  B*tree.  We  have  presented 
how  these  operations  are  performed  and  what  complications  the  delete  operations 
present  and  bow  we  overcome  them.  To  facilitate  data  balancing  on  the  distributed 
B-tree,  we  have  introduced  the  convention  of  naming  nodes  so  that  a node  retuns 
its  nairre  between  processors.  We  will  see  that  this  node  namiog  also  is  useful  when 
replicating  nodes  at  various  processors,  siuce  all  copies  of  a node  have  the  same  name. 

Wc  have  presented  two  mechanisms  fuiidainenlal  to  data  balancing,  namely,  the 

the  negotiation  protocol  to  overcome  the  effect  of  outdated  luformation.  Methods  by 
which  our  algorithms  and  protocols  tolerate  out*of*order  messages  iutroduced  because 
of  network  delays  are  also  presented. 

Finally,  to  study  tlie  portability  aud  scalability  of  our  impicmeotation,  we  ported 
itto  the  KSR,  a large  scale  shared  memory  multiprocessor  system.  In  Ibe  next  chapter 
vre  discuss  the  algorithms  for  replication  aud  load  balancing  and  present  performance 


CHAPTER  5 
PERFORMANCE 


S.l  [ntroduction 

In  this  chapter  we  present  the  various  aigoritbms  for  replication  and  data  bal- 
ancing and  discuss  Lbeir  performance  in  detail.  Experiments  using  the  two  strategies 
for  replication,  namely  full  Ttplicolion  and  path  replication  were  conducted.  Results 
show  that  path  replication  will  create  a scalable  distributed  B-tree.  We  validated  the 
tree  scalability  by  emulating  a large  scale  distributed  B-tree  and  performing  large- 
scale  experiments  on  it.  Several  load  balancing  algorithms  have  heea  developed  and 
their  performance  measured.  The  observations  reflect  that  all  our  load-balancing 
algorithms  incur  very  little  overhead  while  achieving  a good  data  balaoce.  We  also 
discuss  the  performance  of  several  load  balancing  algorithms  on  the  dE-tree.  Three 
algorithms,  namely  random,  merge  and  aggressive  merge  algorithms,  have  been  devel- 
oped for  data  balancing  on  the  dE-trcc,  and  of  these  we  And  that  aggressive  algorithm 
makes  the  dfll-tree  scalable.  Timing  measurements  have  been  conducted  on  our  imple- 
mentation of  the  dB-tree  to  study  the  response  times  and  throughput  of  our  system. 
Wc  present  those  results  in  this  chapter.  Using  the  data  from  the  simulation  experi- 
ments, wc  present  an  analytical  performance  model  of  tho  dB-tree  and  the  dE-tree. 
We  And  that  both  algorithms  are  scalable  to  large  numbers  of  processors. 

Id  this  section,  wc  describe  two  algorithms  for  maintaining  consistency  among 
the  copies  of  nodes.  Based  on  the  theoretical  framework  presented  in  Chapter  3,  we 


have  incorporated  two  replication  strategies  in  our  implementation.  Our  implemen- 
tatioD  of  the  Fixed-Position  copies  algorithm  is  teniied  Full  Replicutiort  and  that 
of  Variable  copies  is  Path  Replication.  We  will  btieHy  discuss  the  algorithms  and 
the  implementational  issues  in  sections  S.2.1  and  h.2.2. 

When  the  nodes  of  the  B-tree  ate  replicated,  an  obvious  concern  is  the  consistency 
and  coherency  of  the  various  replicated  copies  of  a node.  Subsection  5.2.3  will  present 
the  mechanism  by  which  our  implementation  maintaiDS  cohereni  replicas. 


The  Fixed-position  copies  algorithm  (|26])  assumes  every  node  has  a fixed  set  of 
copies.  An  insert  operation  searches  for  a leaf  node  and  performs  the  insert  action. 
If  the  leaf  becomes  full,  a half-spilt  takes  place.  In  this  algorithm,  the  Primary  Copy, 
PC  performs  all  initial  balf-splils  and  sends  a reluytd  split  to  the  other  copies.  Any 
mffinf  inserts  at  a non-PC  copy  are  kept  in  overliow  buckets  and  adjusted  after  the 
relayed  spliL 

In  our  implementation,  tbe  B-tree  is  distributed  by  having  the  leaf  level  nodes 
at  different  processors.  Leaf  level  nodes  are  not  replicated  and  only  these  nodes 
are  allowed  bo  migrate  between  processors.  Wiiencver  a leaf  node  migrates  to  a 
new  processor  (one  that  currently  stores  no  leaves),  the  index  levels  of  the  tree  are 
replicated  at  that  processor.  CoosisleDcy  among  the  replicated  nodes  is  maintained 
by  tbe  primary  copy  of  a node  sending  changes  to  all  its  copies. 

Once  the  entire  tree  has  been  replicated,  only  consistency  changes  need  to  be 

s Algorithm:  The  decision  to  replicate  the  treeis  made  after  a processor  (sender) 


tbe  leaves  are  transferred,  the  sender  checks  to  see  if  the  receiver  has  received 


le&f  nodes  for  the  first  time.  If  so,  the  receiver  obviously  does  not  h&ve  the 
index  levels,  so  the  tree  has  to  be  replicated  at  the  receiver-  The  sender  then 
transfers  the  tree  (index  levels)  it  currently  holds.  Henceforth,  only  consisteocy 


5.2.2  Path  Replication  Aleoritbm 

In  the  Variable-copies  algorithm  ([26]),  different  nodes  have  different  number  of 
copies.  A processor  that  holds  a leaf  node  also  holds  a path  from  the  root  to  that 
leaf  node.  Hence,  index  level  nodes  are  replicated  to  different  extents.  A processor 
that  acquires  a new  leaf  node  may  also  get  new  copies  of  Index  level  nodes  and  such 
a processor  then  joins  the  set  of  node  copies  for  the  index  level  nodes.  Similarly,  a 
processor  will  ‘unjoin'  a node  when  it  has  no  copies  of  the  node’s  children. 

In  our  path  replication  algorithm  whenever  a leaf  code  migrates  to  a different  pro- 
cessor, entire  path  from  the  root  to  that  leaf  Is  replicated  at  this  processor.  However, 
if  the  processor  holds  a leaf  and  a new  sibling  migrates  to  that  processor,  only  the 
patent  nodes  not  already  resident  at  this  processor  are  replicated.  All  link  changes 
are  again  handled  by  tbe  primary  copy  of  a node.  When  a new  copy  of  a node  is 
created,  the  proccs-tor  sends  a ‘join’  message  to  all  the  copies  of  the  node.  In  the  in- 
terim of  the  node  copy  binng  created  at  the  processor  and  the  ‘join’  message  reaching 
a processor,  any  messages  about  this  node  copy  are  forwarded  by  the  primary  copy 
of  the  node  to  this  new  copy.  A processor  that  sends  away  all  the  leaf  nodes  of  a 
parent  will  no  longer  be  eligible  to  hold  the  path  from  the  root  to  that  leaf  node.  In 
this  case,  the  processor  has  to  do  an  ‘unjoin*  for  all  its  nodes  on  the  path  from  the 
root  to  the  leaf. 


Algorithm;  Our  algorithm  for  path  replicalioa  is  eaynchronous,  baaed  on  a 
handahaking  protocol.  When  two  processors  have  inlaracted  in  the  load  bal- 


ancing protocol,  a declfiioD  has  to  be  made  concerning  the  path  from  the  root 
to  Ibc  migrated  leaves.  Either  the  sending  or  receiving  processor  can  request 
that  the  path  be  sent  to  the  receiver.  In  our  algorithm,  the  receiver  determines 
what  ancestor  nodes  are  needed  after  receiving  new  leaves.  It  then  sends  re- 
quests to  the  processors  holding  the  primary  copies  of  the  ancestor  to  get  the 
paths.  As  the  recriving  processor  takes  the  responsibility  of  obtaining  the  path, 
the  sending  processor  is  free  to  continue.  The  receiving  processor  cannot  do 
d.  Once  the  path  is 


obtained,  tbe  leceiviog  processor  cs 


operations  (inserts  and  searches) 


ReBl.lca  Coherency 

The  operations  the  current  implementation  handles  are  searches  and  inserts.  A 

search  returns  a success  or  failure  and  docs  not  cause  any  further  relayed  messages  to 
bo  issued.  An  operation  on  Ibe  distributed  B-trce  can  be  initiated  on  any  processnr. 
Since  the  index  levels  are  fully  or  partially  replicated  at  all  processors,  a change  in 
a node  copy  at  any  processor  must  be  informed  to  all  processors  that  bold  a copy 
of  that  node.  Every  processor  that  stores  a copy  of  a node  must  be  aware  of  all  the 
inserts  on  that  node.  An  insert  operation  in  a node  could  result  lo  a split,  so  all 
processors  must  be  informed  about  tltc  split.  This  is  done  in  the  following  way: 

s Insert  An  insert  operation  can  be  performed  on  any  copy  of  a node.  After  per- 
forming the  insert,  the  prorcasor  sends  a relayed  insert  lo  all  other  processors 
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IhM  hold  a copy  of  the  node.  When  a processor  receives  a relayed  inserl,  it 
performs  the  insert  operation  locally. 

Split  A split  operation  is  first  perfonned  at  a leaf.  If  the  local  parent  exists  oo 
the  same  processor,  the  split  is  informed  at  the  local  parent-  If  the  split  at  any 
level  results  in  a split  at  the  parent  level,  then  a relayed  split  is  sent  to  all 
processors  that  hold  a copy  of  the  parent  node.  Otherwise,  a relayed  insert 


Here,  we  compare  the  performance  of  full  replication  and  path  replication  strate- 
pes  for  replicating  the  index  nodes  of  a B*tree. 


periment.  15,000  keys  were  inserted;  statistics  were  gathered  at  5000  key  intervals. 
The  B-trex;  is  distributed  over  4 to  12  processors.  Each  node  in  the  B-tree  has  a 
maximum  fanout  of  S,  and  average  fanout  of  S.  We  observed  the  number  of  times 
a path  request  has  been  made  by  a processor,  the  number  of  times  that  a load  bal* 
anting  request  had  to  be  reissued  (to  avoid  deadlock),  with  priority  being  given  to 
the  path  request.  We  collected  statistics  as  to  how  many  consistency  messages  are 
needed  to  maintain  the  distributed,  replicated  B>tree,  how  widely  the  index  nodes 
are  replicated  on  each  processor,  and  hnally  how  many  nodes  each  processor  stores 
at  the  end  of  the  run. 

to  maintain  the  replicated  B-tree,  We  see  that  in  case  of  full  replication,  the  number 
of  messages  for  a 4 processor  B*lree  is  around  9000  and  for  12  processors  it  is  around 
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Figure  5.1.  Full  versus  Fatli  Rf'j)lication;  Message  Overhead 
35000  (iur.  the  message  overhead  has  increased  linearly  as  the  number  of  proces- 
sors). However,  Tor  a path  replicated  B-lree.  Tor  4 processors  around  3800  messages 
are  needed  and  for  12  processors  only  0300  messages  are  needed-  not  even  a linear 


The  Space  OverKtiii  (figure  5.2)  graph  shows  the  number  of  nodes  stored  at  all 
processors  at  the  end  of  a run.  The  graph  is  similar  in  nature  to  the  message  overhead 
graph.  In  this  graph  we  consider  only  the  intlex  nodes  that  account  for  the  excess 


storage  at  each  processor  (the  leaf  nodes  cemtuning  nearly  the  same  for  all  processors) 
aa  the  number  of  processors  increase.  For  full  replication,  we  see  that  for  a 4 processor 
B-trec  the  number  of  index  nodes  stored  is  1700,  whereas  for  a 12  processor  B*tree 
the  number  of  nodes  Is  5200,  a nearly  three-fold  iocrease.  In  case  of  a path  replicated 
B-tree,  the  number  of  index  nodes  stored  over  the  entire  tree  for  4 processors  is  900 
and  for  12  processors  is  1550,  not  even  a two-fold  increase. 


The  Wiilh  of  replication  at  feoel  S (Figure  5.3)  graphs  show  how  widely  level  2 
Index  nodes  are  replicated  at  each  processor  for  a path  replic^ated  B-tree.  We  seJecled 
level  2 since  activity  lakes  place  at  the  leaf  level,  1 , and  affects  mostly  at  level  2.  The 

other  chart,  number  of  replicated  nodes  versus  processors,  shows  that  even  as  we 
increase  the  number  of  processors,  the  level  2 index  nodes  are  out  widely  replicated 
at  all  processors,  with  there  being  597  copies  for  a 4 processor  system  and  only  944 
copies  for  12  processors. 


Path  replication  causes  low  restructuring  overliead,  but  can  require  a search  to 
visit  many  processors  for  its  execution.  We  measured  the  number  of  hops  required 
for  the  search  phase  of  the  insert  operation  after  5000  inserts  were  requested  in  an  8 
processor  distributed  B-lrec-  Full  replication  required  an  average  of  .88  messages  per 
search,  and  path  replication  required  1.29  messeges  per  search  (additional  overhead 
of  .41  messages). 

From  Ibe  above  observations,  we  see  that  a path  replicated  distributed  B-tree 
performs  better  than  a fully  replicated  cue  and  is  highly  scalable  (Figure  5.8). 

.5.3  Data  Balancing 

We  have  performed  data  balancing  on  the  dB*tree  and  the  dE-tree.  We  will 
discuss  the  algorithms  and  the  performance  of  the  two  separately. 

5.3.1  The  dB-.trec 

The  results  obtaioed  from  the  Implementation  of  a replicated  B-lree  led  us  to 
explore  other  algorithms  for  data  balancing  on  a replicated  B'tree.  The  experiments 
with  the  replication  algorithms  led  us  to  cooclude  that  a pnth*replicated  B*tree  was 
more  scalable  than  a fully-rcplicaled  B-tree.  Hence,  we  simulated  a path-replicated 
distributed  B-trce.  Our  objective  is  to  develop  data  balancing  algorithms  and  also 
to  observe  their  performance  and  overhead  incurred. 


in  the  current  design,  a bmJt  is  placed  on  the  maximum  number  of  nodes  of  the 
tree  that  a processor  can  hold,  termed  as  the  threshold.  In  addition  each  node  has 
a soft  limit  (.75  * threshold)  on  the  number  of  nodes.  This  represents  a warning 
level  indicating  a need  for  distribution  of  the  nodes.  Whenever  a node  splits,  the 
current  number  of  codes  is  checked  against  the  soft  limit.  If  the  current  number  of 


nodes  exceeds  the  soft  limit,  the  processor  must  distribute  some  of  the  codes  it  has 
to  some  other  processor.  Our  oigorithms  are  characlerized  by  the  method  by  which 
the  receiver  processor  is  selected. 

a Centralized  Data  Balancing; 

One  approach  is  a eemi-ctntnlized  one.  where  the  anchor  is  responsible  for 
choosing  the  reemver.  The  overloaded  processor  requests  the  anchor  for  another 
processor  that  can  share  its  excess  ioad. 

The  anchor  has  outdated  information  about  ali  processors’  current  capacity. 
Based  on  this  obsolete  information,  the  anchor  selects  a receiver  processor, 
a Distributed  Data  Balancer: 

some  of  its  nodes  probes  other  processors  for  load  information.  The  probing  can 
be  done  in  two  ways.  We  assume  that  every  processor  has  a list  of  participating 


- Sequential  Probing;  Here,  processors  begin  probing  other  processors  se- 
quentially to  share  the  load.  A processor  will  pick  the  first  processor  after 
itself  and  checks  if  that  processor  has  sufRcient  capacity.  If  not,  it  probes 
the  next  processor  in  line,  and  so  on. 

- Random  Probing;  In  this  approach,  the  proces.sora  randomly  probe  other 
processors.  The  randomly  selected  processor  is  checked  for  available  capac- 
ity. If  it  does  not  have  enough  capacity,  then  another  processor  (excluding 
the  previously  rejecled  ones)  is  selected  randomly. 


After  & receiver  processor  r has  been  selected,  the  sender  a and  the  receiver  r 
interact  by  a negotiation  protocol  f 4.4).  In  this  protocol,  they  decide  exactly  how 
many  nodes  are  to  be  transferred  from  the  sender  a to  the  receiver  r.  The  negotiation 
protocol  is  very  essential  since,  in  the  interim  that  the  receiver  processor  is  selected 
and  the  actual  node  transfer  tabes  place,  the  receiver  or  sender  may  experience  more 
splits  and  hence  a change  in  their  capacities.  Also,  in  the  case  of  the  centraliKd 
load  balancing  protocol,  since  the  anchor  has  out*of*date  information  about  each 
processor’s  status,  the  algorithm  works  well  because  of  the  negotiation  protocol. 


The  performance  of  the  dB'tree  and  the  dEl-tree  depends  on  how  the  nodes  are 
distributed  among  the  processors,  which  in  turn  depends  on  the  data  balancing  al- 
gorithm. In  addition,  data  balancing  incurs  its  own  overhead. 

There  are  many  non-algorithmic  factors  that  can  affect  performance.  Fust,  the 
number  of  bops  that  ao  operation  requires  to  find  its  data  increases  with  the  height  of 
the  tree.  Secondly,  the  width  of  replication  increases  with  both  increasing  fanout  and 
increasing  numbers  of  processors  that  store  the  dB-tree.  Finally,  the  manner  m which 
additional  storage  is  made  available  to  the  search  structure  affects  the  performance 

a Incremental  Growth; 

When  the  storage  for  the  distributed  index  runs  low,  the  system  manager  must 
add  slorage  capacity  to  some  of  the  processors,  or  allow  the  dB-lree  to  spread 
U>  more  processors.  Periodically,  we  perform  incremental  storage  growth  at  the 
processors  that  store  the  dB-tree.  This  is  equivalent  to  addiog  a disk  to  a site 
or  creating  a new  storage  site.  When  a processor  wishes  to  share  some  of  its 


procesDor  may  be  started  up,  or  in  the  event  that  the  processor  limit  is  reached, 
a processor  is  selected  randomly  and  its  threshold  is  increased  by  a fraction  of 

new  processor  with  newly  added  capacity, 

• Fixed  Height  Data  Balancing; 

To  study  the  effect  of  large  fanout  on  the  witltb  of  replication,  we  fixed  the 
height  of  the  tree  for  all  of  the  experiments. 

Tb  determine  the  nature  of  a large-scale  dB-tree,  we  made  a simulation  study  of 
data  balancing  on  a dB-lree.  We  computed  the  number  of  message  hops  required 
to  complete  an  operation,  and  the  width  of  rep/fcofien,  or  average  number  of  copies 
of  a node.  We  are  mainly  concerned  with  the  width  of  replication  of  level  2 nodes 
(which  are  most  of  the  index  codes).  The  width  of  replicatioo  is  a measure  of  the 
space  overhead  of  maintaining  a distributed  index. 

Experimeuts.  Results  and  Discussion  Experiment  Description;  We  enrale 
an  initial  B-tree  with  a uniform  random  distribution  of  keys.  After  the  initial  B- 
tree  is  created  we  vary  the  key  distribution  pattern  dynamically.  To  study  the  effect 
of  our  losd-balancing  algorithm  when  the  distribution  changes,  we  have  introduced 
hot  spots  in  our  key  generation  paUern,  where  we  concentrate  the  keys  in  a narrow 
range,  thereby  forcing  about  40%  of  the  messages  to  be  processed  at  one  or  two  ‘hot' 
processors. 

To  study  the  load  variation  behavior  under  execution,  we  collected  distributed 
snapshots  of  the  processors  at  inlervais  of  every  10,000  keys  ioserted  io  the  B-tree. 
At  each  snapshot,  we  noted  the  processors’  capscity  in  terms  of  the  oumber  of  leaves 


it  bu,  tb?  number  of  index  level  nodes,  and  the  number  of  keys.  We  also  noted  Lbe 
number  of  times  a processor  invokes  lbe  lt>ad  balancing  algorithm  and  the  number  of 
nodes  it  transfers. 

Other  important  statistics  arc  tbc  Dumber  of  message  hops  for  a search,  the  width 
lated  the  average  number  of  times  a leaf  node  moves  between  prrx:essors  (taken  with 

Tb  calculate  the  number  of  message  hops  for  a search,  we  simulated  10,000 
searches.  A key  to  be  searched  is  generated  using  a uniformly  rbstributed  random 
number.  Since  the  path  is  replicated  at  each  processor,  every  processor  has  a copy  of 
the  rr>ot  of  the  tree.  The  search  begins  at  the  root  of  the  tree  on  a randomly  chosen 
processor.  The  search  proceeds  downward  towards  the  leaves  on  the  processor,  and 
when  a child  has  to  be  searched  that  is  no  longer  on  this  processor,  then  a new  raodom 
processor  is  chosen  from  among  the  processors  that  hold  a copy  of  the  child.  This 
continues  until  a leaf  node  Is  reached.  The  message  count  is  incremented  each  time 
a new  processor  is  selected.  We  also  noted  at  what  level  in  the  tree  these  processor 

all  levels  and  over  each  level. 

The  width  of  replication  indicates  how  widely  the  interior  nodes  arc  replicated  in 
the  distributed  B*tree.  This  gives  us  an  idea  of  how  the  load'balanciog  algorithms 
work.  This  aJso  gives  us  an  estimate  of  the  number  of  mr^age  hops  nccrled  per  search 
and  also  the  amount  of  storage  needed  to  store  the  B-trce.  If  the  algorithm  dt^s  a 
good  job  at  balancing  the  nodes  at  each  processor,  keeping  logically  adjacent  nodes 
close,  then  the  number  ol  cx>pics  of  interior  nodrrs  is  much  less  than  having  randomly 
scattered  leaves  at  every  processor,  which  makes  every  processor  bold  almost  the 
entire  index  levels.  We  calculate  the  following  two  measures: 
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avcns‘  W'/tt  aj  rcplicalion  = # oj  copiet  of  inlerior  norJn  / # »/  inlerior  nctdca 

A finer  sUtUtic  le  the  width  of  replicatiou  at  each  level: 

averse  width  0/  re/r/icotion  ot  /eve/  i = # of  copies  oj  Uvei  i nodes  / # of  level  i 

We  found  that  the  width  of  ceplicaticpn  ia  affected  significantly  by  the  choosing 
which  leaf  nodes  to  migrate.  We  first  used  random  selection,  where  a processor  that 
has  to  distribute  its  load  chooses  the  leaf  nodes  randomly.  With  this  we  found  that 
the  number  of  replicated  copies  was  large.  So,  we  improved  upon  this  by  sending 
out  all  leaves  of  a parent  ttode.  That  is,  we  selected  leaves  sequentially.  The  results 
obtained  were  much  better  and  we  present  them  below. 


• Centralised  and  Distributed  Data  Balancing 


The  f^er/orrnunce  bar  charts  (Figtire5.4)  show  the  processors’  capacity  after  the 
insertion  of  100,000  keys.  When  the  ’’hot-spots”  distribution  is  used  with  node 


Table  5.1.  Load  Balancing  Slaliatioi 


fanoul  7,  proce&soni  2 and  <1  are  Ibe  hot  profxssors,  and  receive  a dispropor* 
tionate  number  of  inserts.  Without  load  balancing,  the  processors  vary  greatly 
in  load,  with  processors  2 and  4 having  around  2500  leaf  nodes  and  processor 
3 haviog  only  around  SQO  leaves.  Our  load  balancing  algorithm  distributes  the 
excess  load  at  processors  2 and  a among  other  processors,  so  that  all  processors 
contain  about  1300  leaf  nodes  when  all  keys  have  been  inserted.  With  a node 
fanout  average  of  10,  processor  9 stores  an  excess  amount  of  leaves  and  the 
load  balancing  algorithm  achieves  a balance  among  all  processors.  The  charts 
also  show  the  reduction  in  storage  as  fanout  is  increased.  With  a fanout  of  7, 
all  processors  store  about  1500  leaves  (about  60%  of  the  maximum  storage), 
whereas  with  a fanout  of  ID,  all  processors  store  less  than  1000  leaves  (about 
40%  oi  the  maximum  storage). 

Table  5.1  shows  the  calculated  average  number  of  probes  made  by  the  load  bal- 
ancing algorithm  and  the  average  number  of  moves  made  by  a leaf  node  in  the 
entire  systeni.  The  centralized  load  balancer  requires  a larger  number  of  probes 
since  the  anchor  does  not  know  the  exact  status  of  all  the  other  processors.  It 
has  to  handle  stale  information  about  the  capacities  of  the  processors.  Among 


Ihc  sequeDti&J  prabiag  and  tbe  ran<loTii  probing  mechanisms,  the  random  prob* 
ing  seems  lo  ret|tiire  a little  less  number  of  probes  than  the  sequential  one.  In 
sequential  probing,  a processor  (say  5)  may  be  probed  by  two  processors  down 
the  line  (say  1 and  J),  but  can  only  serve  one.  Hence,  the  second  processor  (3) 
may  have  to  probe  anotiier  processor  before  its  request  is  met. 


Tlie  average  number  of  moves  of  a leaf  node  show  that  on  an  average  a leaf 
node  moves  only  .5  times  in  the  entire  tree,  so  the  load  balancing  overhead  is 
not  high. 


Figure  5.5.  Average  Number  of  Hops/Search 

shows  that  even  as  tlie  number  of  processors  increase,  the  hops/search  does 
not  increase  linearly.  The  number  varying  from  1.6  for  10  processors  to  2.45 
for  50  processors.  £ven  though  tlie  number  ol  processors  increases  Rvefold,  tlie 
average  number  of  hops  increases  lr«s  than  twofold,  thus  indicating  the  good 
distribution  of  the  nodes  among  processors. 

Wc  also  plotted  the  avtrege  numUr  a/  hops/search  rerses  node  fanout  and  it 
is  seen  that  as  the  fanout  increases,  the  number  of  hops  decreases,  as  expected. 


replicalioD  over  all  levels  is  4.9  for  30  processors  and  with  node  average  fanout 
of  7 when  leaves  are  selected  randomly,  while  it  is  only  9.3  when  leaves  are 
selected  sequentially.  Similarly  at  level  2 it  is  3.1  for  random  selection  and  1.7 
for  sequential  selection  of  leaves. 

in  maintaining  a good  and  very  close  data  balance  among  processors.  The 
moves  than  our  other  algorithms.  The  distributed  algorithm  with  sequential 

than  the  others.  Also,  the  sublinear  increase  indicates  that  the  algorithms  are 
suitable  for  scaling  to  large  trees  with  large  fanout  and  over  many  processors. 
Incremental  Growth  Data  Balancing 

As  enplaned  above,  in  this  algorilhm,  when  none  of  the  processors  have  avail- 
able capacity,  instead  of  increa.Hiiig  the  capacity  of  every  processor  we  select  a 
processor  randomly  and  increase  its  capacity.  The  results  obtained  show  similar 
pattern  to  that  of  the  general  algorithms. 

The  graphs  in  Figure  5.8  show  that  the  average  number  of  hops  varies  between 
1.5  and  2.4  as  we  increase  the  number  of  processors  from  10  to  50.  The  width 
of  replication  at  level  2 (Figure  5.9)  Is  about  1.7  and  the  width  of  replication 
over  all  Icveb  (Figure  5.10)  varies  from  2 to  2.7. 

Fixed-Height  'IVees 

We  performed  simulations  on  fixed  lieight  large  B-lrees  by  inserting  up  to  2-5 
million  keys  and  varying  the  average  fanout  from  10  to  40  (average  fanout  is 
09%  of  the  maximum  fanout  (5]).  In  the  first  experiment  we  fixed  the  tree  height 


Figure  5.8.  Incremcatal  Growth  Algorithm:  Average  Number  of  Hopa/Se&rch 


Figure  5.9.  Incremeotal  Growth  Algorithm;  Width  of  Replication  at 
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to  4.  When  the  root  of  the  tree  had  the  desired  average  Canoot  we  collected 
Btatretice.  We  noted  the  processors'  capacity  in  lernis  of  the  number  of  leaves 
it  has,  the  number  of  index  levei  nodes,  and  the  number  of  keys.  We  also 
Doled  the  numher  of  times  a processor  invokes  the  load  halancing  algorithm, 
the  number  of  probes  required,  the  number  of  nodes  that  it  transfers  and  the 
average  number  of  times  a leaf  node  moves  between  processors  (taken  with 
respect  to  the  iiode!i  in  the  entire  R-tree).  The  pattern  of  these  statistics  hss 
been  studied  In  the  context  of  small  fanout  trees  [30],  so  here  we  concentrate 

a search,  and  the  average  width  of  replication  at  level  2. 

- fixed-height  of  4 trees 

We  first  performed  experiments  with  fixed-height  trees  of  4.  The  graphs 
in  the  Figures  5.11,  5.12,  5.13,  5.14.  5.15  and  5.16  show  the  width  of 
repliralion  at  level  2 and  the  width  of  replication  over  all  levels,  plotted 
against  an  increasing  fanout  for  a fixed  number  of  processors.  The  graphs 
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5.17  through  5.19  show  the  variation  of  the  number  of  hops  with  fanout 
for  a fixed  Dumber  of  processors. 
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Height  4 TVee:  V&rialion  of 


of  Hops/Seuc 


Figure  5.21.  Hngbl  4 Tree:  Variation  of  WirJth  of  Replication  at  Level  2 witii  Pro- 
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Figure  5.22.  Height  4 Tree:  Variation  of  the  Width  of  Replicalico  with  Level 
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Figure  5.23.  Height  4 TVee;  Linear  Regrcasioo  of  the  Width  of  Replication 


The  WORM  level  2 reaches  apUleau  around  2.1  for  10  processora  (5.11) 
around  2.8  for  20  processors  (Kigun?  5.12)  and  8.2  for  50  processors  (Figure 
5.13).  Similarly,  the  width  of  replication  over  all  levels  shows  that  for  10 
processors  the  plateau  is  2.24  (Figure  5.14),  for  30  processors  it  is  8.1 
(Figure  5.15),  and  for  50  processors  it  is  3.8  (Figure  5.16).  We  thus 
notice  that  the  WOR  at  level  2 and  the  WOR  over  all  levels  reaches  a 
plateau  for  a fixed  number  of  processors  as  the  fanout  iocreases. 

The  number  of  hops  required  lo  perform  an  operation  shows  a similar 
phenomenon.  Figures  5.17,  5.18,  5.19  plot  the  number  of  hops  per  op* 
eration  against  increasing  fanout  for  a fixed  number  of  processors.  Again, 
Lbe  uiimber  of  ho|is  quickly  rt'aclies  a plateau.  From  the  table  5.2  we  see 
that  the  nuniber  of  hops  is  nearly  constant  with  increasing  fanout,  and 
reaches  a value  of  1 .99  for  50  processors. 

have  condensed  the  data  ioto  a table  5.2 


From  the  table  5.2.  we  observe  that  for  adB-lree  with  a large  fanout,  the 
width  of  replication  and  the  number  of  bops  per  operation  depend  on  the 
number  of  processors  oniy.  Therefore  we  can  predict  the  number  of  hops 
and  the  width  of  replication  by  studying  the  increase  in  the  plateau  value 
with  an  increasing  number  of  processors. 


• Number  of  Hops: 

From  the  table  5.2  and  figure  5.20  we  see  the  effect  of  increasing 
the  processors  on  the  number  of  hops.  Our  results  indicate  that  the 
hops  do  not  Increase  significantly  and  reach  only  a value  of  1.9.  We 
conclude  that  io  a large  scale  dB-tree  with  4 levels,  an  average  fanout 
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of  40  aod  distributed  over  50  processors,  at  most  2 bops  per  operation 
are  required, 

• Width  of  Replication: 

In  figure  5.21,  we  plot  the  plateau  value  of  the  width  of  replication 
at  level  2 against  the  number  of  processors.  The  linear  regression  of 
the  data  shows  that  the  slope  for  the  ramlom  probing  data  is  .0248. 
For  the  sequential  probing  algorithm,  the  slope  is  ,0295.  Based  on 
the  formula:  width  of  replication  at  level  2 = 1.908  + .0248  * P,  we 
recalculatetl  the  width  of  replication  and  in  figure  5.25,  we  show  a 

values  are  nearly  those  obtained  theoretically.  So,  if  we  have  a 1000 
processors  and  a fanout  of  1000.  then  tbu  WOR  for  level  2 nodes  » 
about  25  for  random  probing  and  30  for  sequential  probing. 
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Another  interesting  nharacteriatic  is  the  variation  of  the  width  of  replica- 
tion with  the  level  of  the  tree.  In  the  path  replicatioo  algorithm  for  the 
dB-tree,  the  width  of  replication  for  the  root  is  the  numher  of  processors, 
and  for  the  leaves  la  I.  We  plotted  the  WOR  for  each  level  of  the  tree  (hg- 
ure  5,22],  keeping  the  number  of  processors  fixed  (SO)  and  fanout  fixed  at 
40.  The  WOR  at  level  ! (leaf)  is  I,  while  at  2 it  is  3-2,  at  level  3 it  is  23.3 
and  for  level  4 it  is  SO.  Thus,  we  see  that  in  a 4 height  tree,  the  width  of 
replicatioo  is  less  than  half  for  the  third  level. 

- Fixed-height  of  3 trees 

For  fixed'height  3 trees,  from  the  charts  5.24  through  5.32,  we  notice 
patterns  similar  to  that  of  height  4 trees.  We  notice  that  the  WOR  reaches 
a plateau  with  increasing  fanout  for  a fixed  number  of  processors,  and 
the  number  of  hops  is  nearly  constant  for  a fixed  Dumber  of  processors. 
However,  the  WOR  at  level  2 is  higher,  reaching  a value  of  6.71  for  50 
processors.  The  WOR  over  all  levels  is  7.74  for  50  processors.  For  the 
height  4 tree,  the  WOR  at  level  2 is  3.2  and  for  level  3 (at  one  level  lower 
than  the  root)  it  is  23,3,  whereas  here  the  WOR  at  level  2 (one  level  lower 
than  the  root)  is  6.71  and  the  maximum  number  of  hops  was  1.69.  Here 
too.  we  took  a linear  regression  of  the  variation  of  the  width  of  replicatioo 
at  level  2 with  the  number  of  processors  and  obtained  a formula  for  the 
width  of  replication  at  level  2 as  2-795  -t-  .9815  * P.  We  recalculated  the 
width  of  replication  at  level  2 using  this  formnla  and  in  Rgure  5.33  we 
show  the  experimental  and  theoretical  values  obtained.  Again,  we  can 
conclude  that  the  WOK.  and  the  number  of  bops  are  greatly  affected  by 
tbe  Dumber  of  processors  over  which  the  B*tree  is  distributed. 


Figure  5.24.  Height  3 Tree:  Width  of  ReplicatioD  at  Level  2 for  10  Proceseon 


ProceumJO 


5.  Height  3 Tree:  Width  of 
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Figure  5.29.  Height  3 Tree:  Width  of  Replicetion 


for  50 


ftoceuon:IO 
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Figure  S.33.  Kdgbt  3 Tree:  Linear  Regression  of  the  Width  at 


Repliulic 


- Pixcd-height  of  5 


For  fixed  height  trees  of  height  5,  we  could  not  gel  statistics  foe  fanouts 
larger  than  30.  as  the  tree  was  very  big.  The  WOR  at  level  2 is  2.37  for 
50  ptocessors  and  the  WOR  over  all  levels  is  2.64  for  50  processors.  The 
number  of  hops  is  maximumat  2.21.  We  Include  the  charts  5.34  through 
5.42  aod  table  5.4  for  the  sake  of  completeness. 
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Figure  5.39.  Height  S Tree:  Width  of  Keplicstion  for  50  Processors 


Figure  5.40.  Height  5 Tree:  Averege  Number  of 


' Hops/Se 
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Figure  5.11.  Height  5 TVee;  f 


' 5.42.  Height  S Tree: 


Tab!«  5.5.  CoinpAri&on  of  Flxed-lieigbt  3,  4 and  3 trees  with  Fanout  20  and 


To  conclude,  we  observe  thal  the  results  obtuoed  favor  Ibe  scalability  of  the  dis- 
tributed B-tree.  Tbe  number  of  hops  depends  on  the  height  of  the  tree  and  the  WOR 
depends  on  the  number  of  processors,  but  grows  very  slowly.  From  the  comparltive 
table  for  different  height  trees  shown  Id  table  S.5,  vre  can  sec  how  the  width  of 
replication  and  average  number  of  hops  per  operation  vary  with  tree  height. 


The  dE-tree  is  a practical  distributed  index  constructed  from  the  distributed  B- 
tree.  The  main  purpose  of  this  being  to  reduce  the  communication  cost  by  storing 


effective  approach  is  to  rnainlain  a single  leaf  (i.e,  an  aleni)  that  maintains  key 
range  information  only,  and  store  the  keys  in  a local  data  structure. 

The  di^renee  between  the  dB-lree  and  the  dE-tree  is  that  it  is  the  load  balancer 
that  decides  whether  to  spill  or  merge  a leaf  node.  The  load  balancer  Is  invoked  when 
a key  is  inserted  into  a leaf  node.  If  the  loa<l  balancer  decides  that  the  processor  bolds 
too  many  keys,  it  decides  to  download  some  of  its  keys  to  some  other  processor.  It 
selects  a leaf  node  and  decides  to  either  perform  a merge  or  a split.  The  processor 
with  which  to  merge  or  ^ve  away  the  split  sibling  is  also  selected  based  on  certain 
criteria.  We  will  explain  the  seiection  criteria  for  the  leaves  and  the  processors  below. 
Algorithms 

Id  each  of  these  algorithms,  tbe  load  balancer  decides  if  the  processor  has  an 
excess  number  of  keys.  Let  tbe  excess  number  of  keys  be  k. 

Random:  As  tbe  nnme  suggests  the  leaf  node  to  be  merged  or  split  is  selected 

randomly. 


in  a processor. 


• Step  1.  Pick  a random  node,  if  that  node  1b  owned  by  processor  P then  go  to 

B Step  2.  If  the  node  n has  a right  ti»ghl)Ot  r and  r’s  owner  proresBOt  has 
available  capacity,  then  transfer  the  excess  nodes  and  stop, 

• Steps.  If  the  left  neighbor  ] of  n bas  available  capacity,  then  transfer  the  excess 
nodes  and  stop. 

Merge:  Here,  vre  select  a leaf  node  such  that  It  can  be  merged  with  either  Its 

left  or  right  neighbor.  If  there  is  no  such  leaf  node,  then  the  largest  extent  leaf  Is 

• Step  I.  Scan  through  the  list  of  nodes  until  an  extent  is  found  that  is  owned 
by  the  processor  P that  has  at  least  K keys.  Let  this  node  be  say  n. 

• Step  2,  If  the  node  n has  a right  neighbor  r and  r’s  owner  processor  has 
available  capacity,  then  transfer  the  excess  nodes  and  stop. 

a Step  3.  If  the  left  neighbor  I of  n has  available  capacity,  then  transfer  the  excess 

• Step  4.  If  you  cannot  find  a processor  that  can  take  all  the  excess  keysK,  then 
scan  through  the  list  of  nodes  till  you  find  another  node  owned  by  this  processor 
P with  the  largest  number  of  keys.  If  found,  go  to  step  2.  else  continue  (If  a 
null  node  is  reached  Indicates  all  nodes  have  been  searched). 

s Step  5.  Tty  and  merge  the  keys  of  node  s with  either  its  right  or  left  neighbor- 
fas  in  step  2 or  step  3).  If  neither  of  them  can  take  the  keys,  select  a processor 
say  R randomly  and  Increase  its  rapacity.  This  is  equivalent  to  adding  extra 
disk  space  at  one  particular  processor. 


Step  6.  See  if  R ia  either  the  right  nrighbor’s  owner  or  the  left  neighbor's 


owner.  If  so,  merge  with  the  right  neighbor  or  left  neighbor  by  transferring  the 
excess  keys  and  slop. 

Step  8.  Split  the  node  s.  Give  the  new  sibling  to  processor  R.  Slop. 


nodes  owned  by  the  processor  for  one  such  that  a oeighbor  can  take  all  keys  offered. 
In  the  aggressive  merge  approach,  wc  first  search  for  a node  that  the  processor  owns 
such  that  a neighbor  can  lake  all  of  tbe  keys  in  the  node.  Then,  if  we  cannot  find 
any  neighbor  that  can  take  all  keys,  we  settle  for  sending  lesser  (than  k)  number  of 
keys.  So,  we  search  for  a nirighbor  that  can  take  the  most  number  of  keys  less  than 
k.  The  strategy  works  because  on  the  next  insert  the  processor  will  balance  again. 

• Step  a.  Set  merge-node  = hHJLL;  Set  maximum  = 0; 

• Step  b.  Pick  the  Rrsl  node  on  the  list  n owned  by  the  processor  P.  that  has 
excess  keys  k.  Let  the  right  neighbor  processor  R have  free  space  f.  If  (f  > 
maximum)  set  merge-node  = n.  Merge  with  the  neighbor  by  transferring 
the  minimum  of  f and  h.  Slop, 

• Step  c.  Scon  through  the  list  and  pick  the  next  node  a in  line.  If  the  end  of 
tbe  list  is  reached,  go  to  stop  d.  Let  the  right  neighbor  have  free  space,  free.  If 
free  > maximum  set  merge-node  = s and  maximum  = free.  Go  to  step 

Step  d.  If  maximum  is  0,  then  go  to  step  6 of  the  merge  algorithtii.  Else 
mergejiode  pves  the  node  llial  can  be  merged  with  its  right  neighbor  by 
giving  away  maximum  keys.  Stop. 


In  the  above  merge  algorithm,  we  search  through  the  set  of 


The  slmuliilioD  of  the  d&-trM  U similar  lo  thal  of  tbe  dB-tree,  ncirpt  that  th« 
leaves  hold  key  ranges  (extents)  and  can  liold  an  arbitrary  number  of  keys.  The 
interior  nodes  have  an  average  fanout,  defined  as  7D%  of  tbe  maximuni  fanout.  We 
pi*rformed  experiments  with  average  fanouts  of  5,  7 and  10.  A total  of  500,000  keys 
were  inserted  and  upto  50  processors  were  used  for  distributing  tbe  B*tree. 


the  initial  dB*tree.  Initially  each  processor  was  given  one  leaf  node  with  a range  of 

Tb  study  tbe  load  variation  behavior  under  execution,  we  collected  distributed 
snapshots  of  the  processors  at  intervals  of  every  50,000  keys  inserted  in  tbe  dE-tree. 
At  earli  snapshot,  we  noted  tbe  processors'  capacity  In  terms  of  the  number  of  leaves 

balancing  algorithm  and  tbe  number  of  nodes  that  it  transfers. 

hops  for  a search,  the  width  of  replication  and  the  number  of  probes  required  for 
load  balandfig.  Wc  also  calculated  tbe  average  number  of  tiniea  a leaf  node  moves 

ResiilU:  We  first  compared  the  ramfoin  and  the  merge  algorithms.  In  this 

experlmeiil  we  built  a d&tree  of  an  average  fanout  of  10  in  the  interior  nodes,  with 
500,000  keys  and  used  from  10  to  50  processors.  We  observed  the  two  algorithms 
behaved  quite  similarly  for  certain  statistics.  We  noticed  both  tbe  algorithms  did 
a good  job  at  maliitainiog  a data  balance  with  the  mean  being  around  .74  and  tbe 


A uniform  random  distribution  of  keys  is  chosen  to  create 


variance  being  O.QOOOQI.  Tbe  number  of  bops  per  mtvtsage  ako  varies  similarly  in 
both  algorithms,  from  1.18  to  2.04  while  varying  the  processors  from  10  to  50  in  a tree 
of  height  3.  Tbe  width  of  replication  varied  between  5.8  to  7-13  for  the  algorithms. 


Number  of  Keys  Processors 


Figure  5.43.  dE-lree:  Comparison  of  the  Random  vs  Merge  Algorithms 

The  difference  in  the  algorithms  is  reflected  in  the  number  of  leaves  and  the 
number  of  interior  oodes  that  are  stored  at  each  processor.  Wc  see  from  the  graph 
5.43a  that  a dE-lree  distributed  over  30  processors,  tbe  random  algorithm  stores 

that  the  merge  algorithm  does  a far  superior  job  at  reduciog  tbe  storage  overhead 
of  the  d£*tree.  However,  tbe  number  of  merges  that  oecur  is  about  lOOO  for  the 
random  algorithm  whereas  for  the  merge  algorithm  tbe  number  is  IdOO.  Also,  there 
is  a restructuring  penally  fur  the  merge  algorithm  with  70  nodes  and  346  copies  being 
touched  (i.e.  involved  in  the  restructuring)  while  only  16  nodes  and  71  copies  are 
touched  with  tbe  random  algorithm. 

The  results  obtained  from  the  above  algorithms  indicate  definitely  that  the  merge 
algorithm  is  more  efficient  in  reducing  storage  space  without  affecting  the  number 
of  hops  per  message  and  the  wiiltli  of  replication.  So,  we  decided  to  explore  the 


merge  algorithm  further.  One  of  the  ftudiea  wiui  to  see  the  effect  of  the  Dumber  of 
keys  Id  the  d&trcc  on  the  number  of  leaves  and  interior  nodes.  So,  we  performed 
experiments  with  2.5  million  and  5 million  keys. 

From  the  Table  5.6  we  see  Ibal  for  ID  processors,  in  a dE-trec  with  2.5  million 
keys,  the  number  of  leaves  is  508,  interior  nodes  is  54,  the  number  of  nodes  touched 
for  restructuring  is  115  and  copies  623.  The  corresponding  numbers  for  a dE-tree 
with  5 million  keys  are,  the  number  of  leaves  516,  interior  nodes  56,  Dodrs  touched 
108  and  copies  566.  So,  we  see  that  increasing  the  number  of  keys,  did  not  greatly 
increase  the  number  of  leaves  and  interior  nodes.  The  same  pattern  can  be  seen  for 
20,  30  40  and  50  processors. 

It  can  also  be  seen  that  both  the  number  of  leaves  and  interior  nodes  incTeaae 
nearly  linearly  with  the  number  of  processors  (figures  5.44  , 6.45,  5.46  and  5.47). 
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FigurL‘3.44.  Effect  of  Increasing  the  Numbet  of  Processors  on  the  Nuiirber  of  Leaves 
stored  in  a d£-tree  with  2.5  million  Keys  for  the  Merge  Algorithm 


Figure  5.45-  Effect  of  Increasing  the  Number  of  Processors  on  the  Number  of  Interior 
Nodes  stored  in  a dE-tree  with  2.5  million  Keys  for  the  Merge  Algorithm 
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Figure  5.46.  Effect  of  Increasing  the  Number  of  Processors  ou  the  Number  of  Leaves 
stored  in  a dE-tree  with  5 loiiliou  Keys  for  the  Merge  Algorithm 


Figure  5.47.  Effect  ofincreasing  the  Number  of  Processors  on  the  Number  of  Interior 
Nodes  stored  in  a dE-tree  with  5 million  Keys  for  the  Merge  Algorithm 


Another  observation  is  that  the  number  of  copies  touched  for  restructuring  is  more 
or  less  constant  as  the  number  of  processors  increase. 

ioterior  nodes,  and  increasing  the  sice  of  the  dE-tree  does  oot  have  a great  effect  on 
the  number  of  leaves  and  interior  nodes.  So,  the  next  step  was  to  see  if  we  couJd 
reduce  the  Dumber  of  leaves  even  further  by  varying  some  input  parameters  to  the 
experiment. 


Extensions:  The  main  objective  of  performing  further  experiments  is  to  observe 

the  effect  of  the  input  parameters  on  the  number  of  ieaves  in  the  dE-tree. 

a Questions  to  be  answered: 

- What  are  the  input  parameters  that  effect  the  extents  of  the  dE-tree? 

- How  does  one  vary  the  selected  input  paramlers? 

a Experiment;  In  our  oripoal  experiment,  we  inserted  2.5  million  keys  in  total  in 
a dE-tree  with  an  average  fanout  of  H).  We  built  our  initial  rlE-tiee  by  assigning 
some  number  of  keys  to  each  processor.  After  the  initial  dE-tree  is  built,  keys 
ore  inserted  for  the  dE-tree  to  grow.  When  a processor  decides  that  it  hold  too 
many  keys,  it  invokes  the  load  balancer,  which  attempts  to  distribute  the  keys 
among  the  active  processors.  If  none  of  the  active  processors  have  available 
capacity,  theo  a processor  is  cboseo  and  its  capacity  increased  by  an  iocrement. 
We  noticed  that  by  selecting  the  number  of  keys  to  build  the  initial  dE-tree 
and  the  increment  appropriately,  we  could  reduce  the  number  of  extents  in  the 
final  dE-tree.  Thus,  we  chose  the  InitiaiJceyB  (heys  to  build  a smalt  initial  dE- 
trec)  and  the  fncretncnl  which  is  the  storage  added  to  a processor  during  load 


Table  5.7.  Comparison  of  Doubling  Initial  Keys  and  Incremnol  for  a d£  tree  with  2. 
5 Million  Keys 


balancing,  as  the  two  input  paramelem  to  vary.  The  increment  remtuns  the 
same  for  the  entire  run. 

The  next  concern  is  how  to  vary  these  paramters.  We  started  off  by  allowing  a 
growth  of  5D  times  for  the  dB-tree,  hence  if  the  hoal  dErtree  holds  2.5  million 
keys,  then  IniliaUieysiti  chosen  as  2.5mtffion/(5l}*num5ero/proceasers).  The 
increment  is  chosen  as  2 • iniiialJieya, 

• Various  Scenarios  and  their  Results: 

We  have  the  following  scenarios; 

- Douhfe  InitiaLkeys,  Original  Inenment 

In  this  scenario  we  allowed  a growth  of  the  tree  25  times,  so  we  Inserted 
(2.5mrffion/50*nurn6ero/proccssera)/2  keys  initially.  The  increment  was 
chosen  as  2«2.5/(5Q*num6ero/processors).  The  number  of  leaves  stored 


&t  the  end  of  the  run  wait  535  and  the  number  of  interior  nodes  was  53. 
The  number  of  nodes  touched  for  restructuring  was  115  and  copies  554 
(for  10  processors)  [Table  5,7), 

- Original  /nirinfjbeps,  Double  IncTtmenl 

Hare,  we  allowed  a growth  of  50  times  for  the  tree  and  hence  inserted 
2.5rniffton/50*num5ero/preccs.ser5  and  doubled  the  lucretDentto  2*(2* 
2,5mrlfion/(50  • numberof processors)).  The  number  of  leaves  stored  at 
the  end  of  the  run  was  253  with  interior  nodes  being  31.  Nodes  touched 
for  reatnicluring  was  299  and  copies  409  (Table  5.7), 

The  numbers  obtained  above  show  that  the  fnitfef,dreps  same,  dauile.iiicrtment 
method  reduces  the  number  of  leaves  nearly  by  half.  Comparing  this  to  the 
original  algorithm,  with  initiolJceys  — 2.5/(50  « numbero/processars)  and 
increment  = 2 " 2.5/(50  * numhero/processors)  we  see  that  the  number  of 
leaves  has  reduced  from  505  to  253  and  interior  nodes  from  54  to  31. 

The  results  obtained  from  performing  the  experiments  were  interesting  enough 
to  prompt  us  to  explore  the  effect  of  the  variation  of  InitiaUceys  and  Increment 
further.  We  noticed  that  it  was  the  Increment  that  was  added  to  a proces- 
sor that  affected  the  number  of  leaves  In  the  dE-tree.  Hence,  we  varied  the 
Increment  . keeping  the  Original  fnilioljtsvs 

After  performing  the  experiments  with  these  two  scenarios,  to  investigate  fur- 
ther we  came  up  with  three  other  scenarios.  We  added  the  following  scenarios 
(as  listed  in  table  5.5): 

- Oijinol  Initiul.ieys.  Half  Increment 

- Original  /ntltof-keys,  (?eoriep  Increment 
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Tabie  5-8.  Various  Scenarios  of  the  input  Parameters  for  a dE'tree  of  2.5  Million 
Keys 


Number  of  Keys 

increnieot 

Oripoal  Keys 

I 

Double  increment 

2-1 

Keys  doubled 
Original  Increiuent 

2-K 

I 

Original  Keys 
Half  increment 

K 

1/2 

Original  Keys 
Quarter  Increment 

K 

1/4 

Oripnal  Keys 
Tenth  of  Increment 

K 

I/IO 

K = 2.S/S0  ' (number  of  procesaonl 
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Tablfr  5.9.  Effect  of  Cbjuigiog  the  Inctement  on  a dE  tree  with  2.5  Million  Keys 


We  observed  that  halving  the  increment  increased  the  number  of  leaves  to  962, 
reducing  the  increment  lo  a quarter  of  its  original  brought  the  number  of  leaves 
to  1992  and  inlerior  nodes  to  205.  Titus,  changing  the  increment  from  .5  lo  1 
and  then  to  2,  changed  the  number  of  leaves  from  962  to  508  to  283  (table  5.9), 
showing  an  aimost  linear  dependence  of  the  number  of  leaves  on  the  increment 
added  to  a processor  during  load  balancing.  Prom  this  we  can  conclude  that 
thesise  of  the  tree  depends  on  the  number  of  times  the  increment  is  performed. 
With  a small  increment,  the  number  of  limes  the  increment  is  performed  is 
large  and  so  aJso.  the  number  of  leaves. 

The  above  discussion  has  shown  thst  the  nierge  algorithm  and  its  various  sce- 
narios definitely  show  an  improvement  over  the  mudom  algorithm.  The  next 
step  was  to  see  if  we  could  improve  tite  results  even  further  by  designing  s new 
algorithm,  so  wc  developed  the  aggressive  merge  algorithm.  We  show  a com- 
parison of  the  merge  and  the  aggressive  merge  algorithm  in  figure  5-48a  for 
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30  processors.  Tbe  number  of  leaves  for  the  merge  algorithm  are  about  204S, 
whereas  (or  the  aggressive  merge  the  number  of  leaves  is  only  339.  The  plot 
shows  us  that  the  aggressive  merge  algorithm  is  definitely  more  efficient. 


NumberofKeys(milljoiisi  Processm 


Figure  5.48.  dE-tree:  Comparison  of  the  Merge  vs  Aggr^ve  Merge  Algorithms 

We  also  plot  the  iiumbct  of  leaves  versus  processors  in  Figures  5.43b  and  5.48b 
and  note  the  quadratic  behavior  of  the  curves.  So,  the  aggressive  merge  algorithm 
does  a much  better  job  at  reducing  the  storage  overhead  at  each  processor,  while 
increasing  Lbu  cost  of  restructuring  as  expected. 
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Figure  5.52.  dE-tree:  Number  of  Leaves  versus  Keys  for  40  processors  for  Aggressiv 
Merge  Algorithm 


Figure  5.54.  dE-tree;  Number  of  Leaves  versus  Processors  for  Aggressive  Merge 
Algorithm 


Id  Hgurcb  5.49  tbixnigh  5.53,  we  ploL  the  number  of  leaves  versus  Ibe  number 

a^reseive  merge  algorithm.  We  also  plot  the  number  of  leaves  versus  processors  for  5 
milllou  keys  and  note  the  quadratic  nature  of  the  curve  (Figure  5.54).  [t  coo  be  seen 
from  the  charts  ( 5.49  and  5.50)  that  the  number  of  leaves  is  fiattening  out,  reaching 
a plateau  for  Ibe  plot  of  10  and  20  processors.  A good  algorithm  should  have  no  more 
than  about  n(n*l}/2  leaf  nodes  (a  processor  is  neighbors  with  every  other  one).  Our 
aggressive  merge  algorithm  achieves  this  os  the  number  of  leaves  flattens  out  with 
Increasing  numbers  of  keys  for  10  aitd  20  processtns.  For  30  or  more  processors,  the 
simulation  did  not  execute  long  coough  to  reach  a plateau  valite,  as  the  nnol  number 
of  leaves  is  less  than  n(o-l)/2  for  n >=  30. 

As  for  the  dBdree  algorithms,  here  too  we  observed  the  width  of  replicration  at 
all  levels,  the  height  of  the  tree  and  the  number  of  hops  per  message  for  a dE-tree 
with  5 million  keys.  We  see  that  the  height  of  the  dE-tree  is  3 for  10  processors 
and  4 for  20  to  50  processors,  with  the  number  of  hops  varying  from  1.05  to  1.74  as 
we  increase  the  processota  from  10  to  50.  The  width  of  replication  at  level  2 varies 
between  from  6.14  to  10.65.  We  thus  see  that  our  algorithm  does  not  aigoiflcantly 
iucreose  the  space  and  message  overhead. 

All  the  above  observations  lead  us  to  conclude  that  of  all  the  algorithms,  the 
aggressive  merge  algorithm  perfonns  the  best,  having  far  fewer  nodes  in  the  dE-tree. 


This  chapter  has  thus  far  concentrated  on  the  performance  of  replication  and 
balancing  algorithms  from  a qualitative  poiut  of  view,  by  doing  large  scale  simula- 
tions- Here,  we  are  concerned  with  characteristics  such  ss  system  response  times 
and  throughput.  The  timing  information  gives  us  on  idea  of  how  fa.st  the  system 
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responds  to  s query  &od  what  the  throughput  of  the  system  is,  in  terms  of  the  num- 
ber of  queries  it  can  process  per  second.  To  obtain  these  timings,  we  go  bach  to  the 
implementation  of  our  distributed  B-tree  that  we  discussed  in  Chapter  4. 

5.4.1  System  Response  Time 

Response  time  means  the  time  taken  for  a single  query  to  be  processed.  The 

from  a node  manager  that  the  operation  has  been  completed.  The  total  time  taken 
for  the  operation  to  complete  is  noted.  The  anchor  then  sends  out  the  next  query. 
The  average  of  the  time  taken  for  all  operations  gives  the  response  time  Ibr  a single 
operation.  Response  time  is  defined  as 

Generation  rate  is  defined  as 

generation  rate  = total  number  of  messages  generated  / fime  tehen  to  generate 


Figure  5.55.  Experimental  Model  for  Measuring  System  Tbiougbpu 
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RKinTimf-nl  Each  processor  has  a separate  process,  the  peneralor  that  generates 
the  messages  (Figure  5.5S).  The  generation  of  the  messages  is  governed  by  the 

generator  communicates  with  the  queue  manager  by  a socket  connection.  It  sends 
the  time-stamped  mes.sages  to  the  queue  manager  at  that  processor,  which  queues 
the  messages  In  the  message  queue  for  the  node  manager  to  pick  up.  The  message 
travels  to  the  correct  leaf  and  once  the  operation  is  completed,  the  node  manager 
time-stamps  the  message  and  returns  it  to  the  anchor.  The  anchor  thus  obtains  the 
time  tahen  (or  each  message.  After  it  receives  all  the  messages,  it  then  calculates  the 
average  response  time  and  generation  rate. 

We  have  chosen  to  observe  the  response  times  of  4,  6,  and  g processors.  In 
our  experiments,  we  obtained  different  generation  rates  by  varying  the  sleep  interval 
between  consecutive  messages  and  noted  the  response  times.  Our  experimeuts  show 
that  the  respoose  time  decreases  as  the  generation  rate  increases.  This  could  be 
because  of  better  access  to  the  CPU,  fewer  page  misses  and/or  cache  misses.  After 
a certain  generation  rate  the  response  time  is  expected  to  increase  as  the  system  is 
driven  to  its  limit  ajid  Iregins  to  slow  down  due  to  queueing  delays,  etc.  We  have 
not  been  able  to  observe  this  trend  in  our  experiiiieiils,  even  though  we  employed 
the  maximum  generation  rate  possible.  The  generation  rate  is  limited  by  the  system 
clock  granularity  and  thus,  from  our  experiments,  we  observe  that  the  generation  rate 
possible  (with  no  sleep  intervals)  is  not  sulHciently  large  enough  to  flood  the  system 
willi  messages.  So,  for  all  practical  purposes,  we  can  safely  assume  that  we  need  to 
consider  the  lowest  response  time  as  our  system  response  time. 

Results  The  graphs  show  the  respoose  times  and  generation  rates  for  4,  d and 


Figure  5.56.  Response  Times  for  a 4 Processor  System 


Figure  5.57.  Response  Times  for  a 6 Processor  System 

it  was  observed  that  for  a 4 processor  system,  it  takes  about  35  milliseconds  for 
an  operation  to  complete,  for  a 6 processor  system  it  lakes  40  milliseconds  and  for 
an  8 processor  system,  the  response  time  is  55  milliseconds- 

!n  order  to  justify  these  timings,  it  is  necessary  to  know  the  message  transit  lime, 
processing  time  at  a processor  and  queuing  time.  It  is  dllEcult  to  get  an  estimate 
of  the  queuing  time,  but  we  performed  some  simple  experiments  to  determine  the 
message  transit  time  and  the  processing  time. 


H5 


Figure  5.5S.  Response  Times  for  a 8 Processor  System 


Experimcalel  Model:  f^r  the  processing  time  at  any  processor,  aoiLprocessjoy.rim 

we  use  the  same  model  that  we  used  lo  collect  timinp.  When  the  node  manager 
receives  a message  it  time>stamps  it  with  the  processing.start.time  and  after  pro- 
cessing the  message,  it  again  time-stamps  the  message  with  the  processing.end.time. 
All  the  messsges  are  returned  to  the  anehor,  so  the  anchor  calculates  the  average 
unil.processinp.h'mc. 

forth  between  all  Ihe  processes  and  the  anchor.  Each  message  is  time-stamped  with 
the  time  that  it  was  sent  and  the  time  it  was  received  at  another  processor.  The 
anchor  theo  finally  collects  all  the  messages  and  calculates  the  average  time  it  takes 
for  a message  lo  travel  between  any  two  processors. 

From  our  experiments  we  observed  that  Uie  message  trannt  time  between  two 
processors  is  approximately  5.2  milliseconds  and  the  unit  processing  lime  is  around 
4.4  milliseconds.  In  tabic  5.10,  we  calculate  the  processing  times  and  message  transit 
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times  ss  follows; 

processing  lime  = (number  of  bops  + 1)  ” unit.processing-tlme. 

message  transit  time  = (number  of  hops  * unit  jnessage_time)  + rrmndtrip  time 

The  rouiidtrip  lime  to  aoebor  is  added  since  a message  starts  at  tbe  anchor  and 
returns  to  the  anulior. 

The  time  difference  io  the  table  5.10  can  be  attributed  to  the  delays  that  include 
message  collisions,  process  context  switches,  disk  swaps  etc. 

5.5  Performance  Model 

Io  this  section,  we  present  a simple  analytical  model  that  predicts  operation  re> 
spouse  times  and  tbe  maximum  throughput  of  the  distributed  search  structures  de- 
scribed in  this  paper.  The  performance  depends  on  the  structure  of  the  dB-tree  or 
dE-lree.  For  example,  both  the  number  of  bops  per  operation  aod  tbe  degree  of 
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replication  affect  the  amount  of  overhead  required  to  maintain  the  search  structure. 
These  values  are  very  difficult  to  calculate,  and  they  depend  on  the  algorithm  used 
to  perform  the  data  balancing.  For  this  reason,  we  will  use  the  estimates  of  the 
number  of  hops  and  the  degree  of  replication  developed  in  Section  5.3.1.  The  mt>del 
described  in  this  section  is  loosely  based  on  the  model  presented  in  [2S|.  We  assume 
that  operations  arc  generated  uniformly  at  all  processors,  and  the  accesses  are  made 

We  hrst  define  the  variables  that  we  use  in  the  analysis: 

L:  Number  of  levels  In  the  search  structure  (level  1 is  the  leaf,  level  L is  the  root). 
P:  Number  of  processors  that  maintain  the  search  structure. 

H:  Average  number  of  hops  required  to  navigate  to  a leaf. 

ft;  Degree  of  replication  at  level  i,  i = l,...,i.  fti  = 1 and  ftn  = ft. 

F:  Maximum  node  fanout. 

9,;  Probability  that  an  operation  is  an  insert  operation. 

Pnii  Probability  that  an  operation  causes  restructuring  (split  or  merge). 

ta>'  Time  to  process  an  action. 

fm ; Processing  time  for  sending  and  receiving  a message. 

Am;  Total  arrival  rate  of  operations  to  the  distributed  search  structure. 

N,:  Average  number  of  actions  generated  by  an  operation. 
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N,^:  Average  number  of  mesaages  geocrated  by  an  operation. 

W:  Wwting  lime. 

T;  Response  limeof  an  operation. 

Thfnar;  Maximujn  Ibrougbpul. 

Wc  start  by  determining  tbc  number  of  messages  and  actions  required  to  process 
an  operatioD,  and  N„.  Since  there  are  L levels,  L search  actions  are  required. 
Since  each  operation  requires  H Lops.  H + I messages  are  required  (a  slightly  pes> 
Himistic  estimate).  In  addition,  an  oiieration  might  cause  restructuring.  If  there  are 
more  inserts  than  deletes,  then  pr„  nr  l/(,68  • F)  [28].  When  a node  splits,  the 
sibling  is  created,  its  right  and  left  neighbors  must  be  informed,  and  all  copies  of  the 
parent  must  be  informed  about  the  new  sibbng.  In  turn  the  parent  might  split,  with 
probability  p,a,.  Therefore, 


If  A is  the  rate  at  which  operations  are  generated  at  a node  that  helps  to  maiutain 
the  distributed  search  structure,  then  the  total  rate  at  which  operatioos  are  generated 


A processor  that  lielps  to  maintain  the  distributed  search  structure  will  be  re- 
quired to  process  jobs  that  correspond  to  actions  and  jobs  that  correspond  to  message 
passing.  The  average  time  to  process  a job  is: 


(5,1) 


(5,2) 


. = f>A 


(3.8) 


i..,  = (N.l,  + N„t„)l{N,  + W„) 


(5.4) 
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Since  the  root  is  fully  replicated,  it  la  not  a bottleiiecV.  If  the  data  balancing 
dislributea  Ibe  nodes  properly,  tbni  no  leaf  node  ia  a bottleneck  dther.  Therefore, 
the  work  to  execute  an  operation  is  evenly  spread  amoog  the  processors  in  the  system, 
As  a result,  the  processor  utilization  due  to  search  structure  processing  is 

p = X/iN.t,  + N„t„)  (5.5) 

The  time  that  a job  spends  waiting  lor  processor  service  can  now  be  calculated 
by  ^plying  a queuing  model.  We  use  a simple  M/M/1  queue,  and  find  that 

W = (5.6) 

The  time  to  get  a response  from  an  operation  is  the  time  to  process  all  messages 
and  actions  associated  with  the  operation. 

r = i(W  + (,)  + («  + l)(W  + 1,  + („)  (5.7) 

The  maximum  throughput  is  the  maximum  rate  at  which  every  processor  can 
execute  the  jobs  assodaled  with  the  search  structure  operatious. 

TA,„„  = P/(A'.f.  + fV„(„)  (5.S) 

In  a distributed  search  sliuctuie  with  a large  number  of  processors,  the  overhead 
of  maintaining  the  search  structure  is  primarily  due  to  the  number  of  bops,  H,  and 
the  cost  of  mainlining  the  level  3 nodes.  As  we  saw  in  Seclioo  5.3,1,  7/  approaches 
an  asymptote  for  a 6xed.hcight  tree.  The  algorithms  described  in  |36]  require  R2 
actions  for  every  split  of  a level  1 code.  Fortunalely.  we  found  that  grows  very 
slowly  with  increasing  P.  As  a result,  the  overhead  of  maintaining  a dB-tree  does 
not  increase  as  fast  as  the  processiog  power  of  the  system  increases  when  processors 
arc  added.  As  result,  the  dBdree  algorithm  is  scalable  to  a very  large  number  of 
processors. 
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S.5.1  An  ABDlication 

• Analysis  for  a 8 processor  dB'tree: 

Lei  us  make  an  analysis  of  a dB-trcc  distributed  over  S processors  for  vrbich 
we  have  performed  the  timing  experimenks.  So,  we  have  P s S and  average 
fanout  / = 10.  in  Section  5.3.1,  we  saw  that  in  a large-fanout  dB-tree  with  i 
ievels.  the  number  of  hops  is  about  3.  and  the  width  of  repilcation  on  ievel  2 is 
about  1.808  -I-  .0243  « P.  where  P is  the  number  of  processora.  We  have  found 
that  the  ievel  3 nodes  are  replicated  at  nearly  half  the  number  of  processors,  so 
we  will  a.s8unie  that  ffs  = P/2.  We  measured  the  time  to  process  a message  as 
fa  = .0044  seconds  and  transmission  time  for  a message  as  (■  = .0052  seconds 
(5.10). 

Witb  these  statistics  in  mind,  we  will  use  the  following  additional  parameters 

U = .001 

9,=  .1 

JV..=  !/(/)  = . 1 

We  use  these  parameters  to  determine  the  number  of  messages  and  actions  that 
an  operation  generates. 

N.  = 4.0S3 
= 3.040 

We  can  use  the  the  estimates  of  the  number  of  actions  and  messages  to  compute 
the  average  execution  time  and  the  maximum  throughput: 


= 0020 
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Th^  = 382 

WUh  a processing  rate  of  191  operations  per  secood,  p = 1/2,  aod  the  response 
lime  for  an  operation  is  .0565  s4*conds.  riom  the  chart  5.58,  we  see  that 

pessimistic  value  taking  into  account  some  queuing  time, 
s Analysis  for  a 50  processor  dB‘tree: 

We  will  do  a similar  analysis  for  larger  dB-trccs  that  we  used  in  our  simulations 
in  sectioD  5.3.1  with  P s 50  and  average  fanout  / s 40  From  our  eaperimenta 
we  obtained  the  Widths  of  Replication  at  level  I aa  = 1,  at  level  2,  A;  = 3.2, 
at  level  3,  Rs  = 23.9  aod  at  level  <1,  R|  s 50  ( 5.22).  The  cumber  of  hops  is  2 
for  a large  dB*tree  with  4 levels. 

Again,  similar  to  the  analysis  for  8 processors,  here  too  we  use  additional  pa* 
rameters  as  input  to  the  model; 

(.  = .0W4 

i.  = .0052 

Ir.  = .001 

?,  = .1 

?-s.=  W)  = .025 

We  then  compute  the  number  of  ineasages  and  actions  that  an  operation  gen* 


At.  =4.018 


W,  = 3.013 


We  now  calcuiate  the  average  execution  lime  and  maximum  throughput: 


= .0029 

Th„„  = 24ir 


With  a processing  rale  of  1209  operations  per  second,  p ^ 1/2,  and  the  response 
time  for  an  operation  is  .0060  .seconds. 


For  a comparison,  consider  the  performance  of  a centralised  index  server  that 
has  the  same  message  passing  cost,  = ,001.  Servicing  each  request  requires 
the  processiog  of  two  messages  (the  request  and  the  response).  We  will  assume 

operation  requires  .0064  seconds,  allowing  a maximum  throughput  of  106.20 

the  response  time  for  an  operation  is  .030  seconds.  Therefore,  at  the  cost  of 
doubled  latency,  the  throughput  is  increased  by  a factor  of  10  by  using  the 
distributed  search  structure. 


In  tliin  chapter,  we  have  described  extensively  all  tbe  algorithms  developed,  ex* 
periments  conducted  and  the  performance  results  obtained.  Here,  a brief  summary 


Replication:  In  section  5.2,  we  have  presented  two  aigoritbms  for  replicatioo 
oamely,  /uff  nplieatiou  and  path  rtplicalion  and  discussed  the  method  of  main* 
taining  replica  coherency.  To  compare  the  performance  of  the  algorithms  we 
examined  the  overhead  of  the  full  and  path  replication  of  the  index  nodes  and 
found  that  path  replication  imposes  much  less  overhead  io  terms  of  space  and 
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messages  than  full  replication.  The  width  of  replication  measure  for  path  repli- 
cation shows  a sublinear  increase  with  number  of  processors  and  hence  permits 
a scalable  distributed  6-tree, 

Data  Balancing;  We  also  conducted  simulatione  on  a large  scale  to  validate 
the  results  obtained  from  our  implementation.  The  simulation  results  show 
that  our  algorithms  for  dala-lond-balancing  achieve  a good  data  balance  among 
processors  without  imposing  much  ovcriicad.  An  average  node  moves  only 
about  .5  times  in  the  eotire  tree  so  the  load  balancing  overhead  is  not  high. 
The  centralized  and  distributed  data  balancers  perform  equally  well,  with  the 
distributed  algorithm  using  sequential  probing  achieving  a good  balance  keeping 

- Incremental  Clrowtli  Data  Balancing:  The  results  of  this  experiment  are 
similar  to  those  of  the  general  algorithms  with  the  width  of  replication 
bring  1.7  aod  the  number  of  hops  around  2.4  for  50  processors. 

- Fixed-Height  Data  Balaoclog:  We  performed  experimentswith  hxed-belght 
trees  of  3,  4 aod  5>  with  fanouts  varyiog  from  ID  to  40  and  the  number  of 
processors  varying  from  10  to  50.  With  all  of  them  we  noticed  that  the 
width  of  replication  reaches  a plateau  with  increasing  fanout.  Ihsr  exam- 
pin,  from  table  5.2  we  see  that  with  a fanout  of  40  and  with  50  processors, 
in  a tree  of  height  4 the  widtii  uf  replication  at  level  2 is  3.23  and  the 
width  of  replication  over  all  levels  is  3.5  and  the  number  of  hops  is  1.99. 
This  is  io  accordance  with  the  formula  we  have  derived  for  the  width  of 
replication  at  level  2 which  is: 
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mdlh  of  rtpUeation  at  Uwl  S = I.90S  + -OSfS^P,  where  P is  the  number 
of  processors. 

We  also  notice  that  the  width  of  replication  aod  number  of  bops  depend 
only  on  the  number  of  processors.  The  fixed-heiglil  dB-tree  experitnents 
show  that  our  algorithms  are  suitable  for  larger  trees  with  a large  fanout. 
Thus,  all  our  algorithms  make  the  B-trce  scalable. 

the  number  of  keys  held  by  a processor.  We  first  compared  two  algorithms, 

number  of  extents  in  the  dE-trcc.  locrcasing  the  sisc  of  the  dE-trce  did  not 
overly  reduce  the  number  of  extents  oven  with  the  merge  algorithm,  so  wemade 
some  extensions  hy  changiog  the  input  parametein  to  the  dE-tree.  We  varied 
the  initial  number  of  keys  and  the  increment  added  to  a processor  wlien  it  runs 
short.  We  found  that  the  number  of  extents  raries  linearly  with  the  number  of 
times  the  increment  is  performed,  with  the  number  of  leaves  being  large  when 
tbe  increment  is  small  (table  5.5).  We  then  developed  tbe  aggressive  merge 
algorithm,  where  we  settle  for  sending  as  many  keys  as  needed.  We  found 
that  of  tbe  three  tbe  aggressive  merge  is  the  best,  rcduciog  the  space  overhead 
(interior  nodes  and  leaves)  to  a minimum.  The  asymptotic  number  of  leaves  in 
a dE-tcee  using  the  aggressive  merge  algorithm  is  about  n(n  - l)/2,  which  is 
typically  much  smaller  than  the  number  of  leaves  in  a dB.tree. 

Timing  Study:  We  performed  timing  experiments  on  our  impleoientalion  to 
gather  some  idea  on  the  system  response  times  for  search  and  insert  opera, 
lions.  We  found  that  with  5 processors  the  system  response  lime  was  about  45 
mJlliseeonds.  Using  table  5.ID  we  account  for  the  timings  wc  have  obtained. 
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Analytical  Pcrrortnance  Model;  We  uaed  the  characteristics  ol  the  large  scale 
dB<trce  to  develop  a rimple  analytical  performaoce  model.  We  then  studied 
the  elTecl  ol  increasing  the  number  of  processors,  and  found  the  overhead  of 
Tnaintainiiig  the  dB-trec  grows  very  slowly.  We  applied  the  performance  model 
to  analyse  the  results  obtained  from  our  experiments  on  the  dB-tree  with  8 pro- 
cessors and  50  processors.  With  both  analyses  wc  found  that  the  model  predicts 
a slightly  larger  response  lime  than  what  we  obtained  by  out  experiments.  Our 
experiments  provided  a response  tirrieof  50  milliseconds  (Hgure  5.58)  while  the 
model  predicted  56  milliseconds.  Finally,  from  the  model  the  observation  was 
that  Lbe  overhead  of  maintaining  a dB-tree  is  not  signibcantly  alfected  by  the 
node  fanout  as  long  as  the  fanout  is  large.  Wc  found  that  a distributed  search 
structure  permits  a much  larger  throughput  than  a centralized  index  server,  at 
the  cost  of  a modestly  increased  response  time. 


CHAPTER  6 
CONCLUSIONS 

Id  tLii  dias«rtAtion  we  have  worked  oo  dislributed  B-trees.  Our  contribution  to 
this  bu  been  the  development  and  implementation  of  several  algorithms  for  data 
balancing  a distributed  replicated  B-trcc.  in  Chapter  I we  presented  the  goal  of 
our  work  and  provided  the  motivation  for  pursuing  this  research.  We  aiso  presented 
some  background  on  distributed  data  structures.  We  selected  the  B-tree  because  of 
its  flexibility  os  a distributed  structure. 

In  Chapter  2 we  discussed  concurrent  B-trees  and  distributed  B-trees.  We  also 
presented  useful  applications  of  the  distributed  B.tree,  oameiy  the  distributed  extent 
tree,  the  dUctree  and  its  usefulness  for  parallel  striped  file  systems. 

In  Chapter  3 we  presented  the  theoretical  framework  of  the  replication  idgorithms 
developed  by  us.  We  presented  two  approaches,  Axed-posltion  copies  and  variable 
copies.  We  also  implemented  these  algorithms  and  they  are  termed  full  replication 
and  path  replication  for  the  purposes  of  onr  implementation. 

and  detaiis  on  the  node  migration  mechanism  that  is  ruodamcnlal  to  data  balancing. 
We  also  presented  the  negotiation  protocol  that  is  ioherent  in  our  data  balancing 

discussed.  We  have  also  studied  the  portability  of  our  implementation  by  porting  it 
to  the  KSR,  a shared  memory  multiprocessor  systems  with  96  processors. 
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Finally,  in  Chapter  5 we  preunted  all  the  algorithms  that  we  have  developed,  lor 
replication  and  data  balancing.  We  discussed  the  algorithms  and  tbar  performance 

The  perfonnancc  results  of  the  replication  algorithms  show  that  among  the  two 
methods  of  replication,  path  replication  performs  better  and  is  suitable  for  scaling 
to  targe  trees.  It  incurs  much  Itrss  ovtrrhead  than  full  replication.  The  width  of 

suitable  for  scaling  to  a large  number  of  processors.  So,  we  look  this  approach  aod 
performed  simulations  for  data  balancing  on  large  distributed  B'trees, 

We  developed  centralised  aod  distributed  algorithms  for  data  balancing  and  ob- 
served that  distributed  algorithms  with  sequential  probing  perform  very  well  com- 
pared to  the  others,  in  terms  of  the  number  of  probes  and  moves.  On  an  average 

gorithms  achieve  a good  data  balance  while  incuring  very  little  overhead.  We  also 
presented  the  results  of  two  dllTereot  scenarios  of  incremeotal  growth  data  balancing 
and  fixed  height  tree  data  balancing.  The  incremeotal  growth  performance  shows 
patterns  similar  to  the  generalised  algorithms  with  the  width  of  replication  being 
around  1.7  and  the  number  of  hops  around  2.4  for  fit)  processors.  We  performed 
experimeots  for  fixed  height  trees  of  fi,  4 and  fi  with  fanout  varying  from  ID  to  40 
and  the  number  of  procossora  varying  from  10  to  50.  We  observed  that  the  width 
of  replicatiOD  varies  with  the  number  of  processors  only,  while  quickly  reaching  a 
plateau  with  increasing  fanout.  The  fixed-height  dB-tree  experiments  show  that  our 
algorithms  are  suitable  for  larger  trees  with  a large  fanout. 

We  simulated  the  distributed  extent  tree,  the  d&tree  and  performed  data  balanc- 
ing on  it.  We  developed  throe  algorithms  for  balancing,  namely,  random,  merge  and 
aggressive  merge.  Of  these  the  aggresssive  merge  algorithm  does  the  best  in  achieving 


a good  data  balance  with  negligible  overhead.  The  aeymptotic  number  of  leaven  in  a 
dE-tree  using  tbe  aggressive  merge  algorithm  is  about  n(n  - l)/2.  which  ia  typically 
much  smaller  than  tbe  Dumber  of  leaves  in  a dB-tree.  With  tbe  merge  algorithm, 

noted  that  the  dE-tree  porlbrmaoce  was  greatly  affected  by  the  increntent  size  that 
is  used  to  add  storage  to  a processor  when  it  runs  short. 

In  order  to  determine  bow  well  our  iiiiplenicotation  works,  we  performed  limiog 

obtained  a response  time  of  50  milliseconds.  We  have  also  provided  an  explanation 
of  the  limJnp  we  obtained  (table  5.10). 

Lastly,  we  presented  an  analytical  model  to  validate  our  experimental  results.  We 

model  predicts  a more  pessimistic  time  than  the  timing  we  obtained. 
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